rm(list=ls())
library(scales)
draw_confusion_matrix <- function(cm) {
layout(matrix(c(1,1,2)))
par(mar=c(2,2,2,2))
plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')
title('CONFUSION MATRIX', cex.main=2)
# create the matrix
rect(150, 430, 240, 370, col=scales::alpha('blue',1))
text(195, 435, 'Failure', cex=1.2)
rect(250, 430, 340, 370, col=scales::alpha('orange',1))
text(295, 435, 'Success', cex=1.2)
text(125, 370, 'Actual class', cex=1.3, srt=90, font=2)
text(245, 450, 'Predicted class', cex=1.3, font=2)
rect(150, 305, 240, 365, col=scales::alpha('orange',1))
rect(250, 305, 340, 365, col=scales::alpha('blue',1))
text(140, 400, 'Failure', cex=1.2, srt=90)
text(140, 335, 'Success', cex=1.2, srt=90)
# add in the cm results
res <- as.numeric(cm$table)
text(195, 400, c("TP\n\n",res[1]), cex=1.6, font=2, col='white')
text(195, 335, c("FP\n\n",res[3]), cex=1.6, font=2, col='white')
text(295, 400, c("FN\n\n",res[2]), cex=1.6, font=2, col='white')
text(295, 335, c("TN\n\n",res[4]), cex=1.6, font=2, col='white')
# add in the specifics
plot(c(100, 0), c(100, 0), type = "n", xlab="", ylab="", main = "DETAILS", xaxt='n', yaxt='n')
text(5, 75, "Sensitivity", cex=1.2, font=2)
text(5, 35, round(as.numeric(cm$byClass[1]),4), cex=1.2)
text(20, 75, "Specificity", cex=1.2, font=2)
text(20, 35, round(as.numeric(cm$byClass[2]),4), cex=1.2)
text(35, 75, "Precision", cex=1.2, font=2)
text(35, 35, round(as.numeric(cm$byClass[5]),4), cex=1.2)
text(52, 75, "Negative Predictive\n value", cex=1.2, font=2)
text(52, 35, round(as.numeric(cm$byClass[4]),4), cex=1.2)
# add in the accuracy information
text(70, 75, "Accuracy", cex=1.2, font=2)
text(70, 35, round(as.numeric(cm$overall[1]),4), cex=1.2)
text(90, 75, "F1", cex=1.2, font=2)
text(90, 35, round(as.numeric(cm$byClass[7]),4), cex=1.2)
}
In this report, we will analyze data about simulation crashes in climate models. Simulations using IPCC (Intergovernmental Panel on Climate Change)-class climate models are subject to fail or crash for a variety of reasons, that ranges from numerical instability to particular climate phenomena. Since this class of models is extremely complex, detecting the cause of the error is an unfeasible task in most of the cases. Our intent is to use machine learning techniques in order to understand the relation between the probability of a crash and POP2 parameters, since this would allows us to improve our models. The data set taken into consideration contains a series of simulation crashes within the Parallel Ocean Program (POP2) component of the Community Climate System Model (CCSM4). The data set is available at link http://archive.ics.uci.edu/ml/datasets/Climate+Model+Simulation+Crashes and it is presented at [1].
In this study the ocean model perturbed parameters were selected by POP2 model developers. These parameters and their uncertainty ranges are summarized in Table 1.1, which was found in [1]. As can be seen from the table, the model parameters are represented using standard uniform or log-uniform probability distribution functions normalized \([0, 1]\) using ranges (low and high values) and scales (linear and logarithm).
| Parameter | [low, default, high] | Scale | Description | |
|---|---|---|---|---|
| 1 | vconst_corr | \([0.3, 0.6, 1.2]\times 10^7\) | linear | variable viscosity parameter (vconst 1, vconst 6) |
| 2 | vconst_2 | \([0.25, 0.5, 2.0]\) | logarithmic | variable viscosity parameter |
| 3 | vconst_3 | \([0.16, 0.16, 0.2]\) | linear | variable viscosity parameter |
| 4 | vconst_4 | \([0.5, 2.0, 10.0]\times 10^{−8}\) | logarithmic | variable viscosity parameter |
| 5 | vconst_5 | \([2, 3, 5]\) | linear | variable viscosity parameter |
| 6 | vconst_7 | \([30.0, 45.0, 60.0]\) | linear | variable viscosity parameter |
| 7 | ah_corr | \([2.0, 3.0, 4.0]\times 10^7\) | linear |
diffusion coefficient for Redi mixing (ah) and background horizontal diffusivity within the surface boundary layer (ah_bkg_srfbl) |
| 8 | ah_bolus | \([2.0, 3.0, 4.0]\times 10^7\) | linear | diffusion coefficient for bolus mixing |
| 9 | slm_corr | \([0.05,0.3,0.3]\) | logarithmic | maximum slope for bolus (slm b) and Redi terms (slm r) |
| 10 | efficiency_factor | \([0.05, 0.07, 0.1]\) | linear | efficiency factor for submesoscale eddies |
| 11 | tidal_mix_max | \([25.0, 100.0, 200.0]\) | logarithmic | tidal mixing threshold |
| 12 | vertical_decay_scale | \([2.5, 5.0, 20.0]\times 10^4\) | logarithmic | vertical decay scale for tide induced turbulence |
| 13 | convect_corr | \([1.0, 10.0, 50.0]\times 10^3\) | logarithmic |
tracer (convect diff) and momentum (convect visc) mixing coefficients in diffusion option |
| 14 | bckgrnd_vdc1 | \([0.032, 0.16, 0.8]\) | logarithmic | base background vertical diffusivity |
| 15 | bckgrnd_vdc_ban | \([0.5, 1.0, 1.0]\) | linear | Banda Sea diffusivity |
| 16 | bckgrnd_vdc_eq | \([0.01, 0.01, 0.5]\) | logarithmic | equatorial diffusivity |
| 17 | bckgrnd_vdc_psim | \([0.1, 0.13, 0.5]\) | logarithmic | maximum PSI induced diffusivity |
| 18 | Prandtl | \([4.0, 10.0, 20.0]\) | logarithmic | ratio of background vertical viscosity and diffusivity |
The data set has 540 rows, divided into 3 different ensemble studies, each of them containing 180 observations. The number of columns is 21. Each row refers to a specific simulation, while columns contain the ensemble study associated to the simulation, the registered values of the normalized POP2 parameters (introduced in Table 1.1) and a variable “outcome” that indicates whether a failure occurred or not, as well. There are no null/missing values in the data set. We are now reporting the main statistical measures for each of the attributes:
set.seed(1)
#climate=read.csv("/Users/Tommy/Desktop/DS/Tesina/pop_failures1.csv",sep=";")
climate=read.csv("/Users/jo961/Desktop/tesina_data_spaces/pop_failures1.csv",sep=";")
climate$outcome=as.factor(ifelse(climate$outcome=="0","failure",ifelse(climate$outcome=="1","success",NA)))
climate$Study=as.factor(climate$Study)
summary(climate)
Study Run vconst_corr vconst_2
1:180 Min. : 1.00 Min. :0.000414 Min. :0.001922
2:180 1st Qu.: 45.75 1st Qu.:0.249650 1st Qu.:0.251597
3:180 Median : 90.50 Median :0.499998 Median :0.499595
Mean : 90.50 Mean :0.500026 Mean :0.500097
3rd Qu.:135.25 3rd Qu.:0.750042 3rd Qu.:0.750011
Max. :180.00 Max. :0.999194 Max. :0.998815
vconst_3 vconst_4 vconst_5 vconst_7
Min. :0.001181 Min. :0.001972 Min. :0.0008584 Min. :0.0004755
1st Qu.:0.251540 1st Qu.:0.250158 1st Qu.:0.2506297 1st Qu.:0.2513251
Median :0.500104 Median :0.500456 Median :0.5009033 Median :0.4991738
Mean :0.500027 Mean :0.500119 Mean :0.5000014 Mean :0.4999131
3rd Qu.:0.749180 3rd Qu.:0.750348 3rd Qu.:0.7489880 3rd Qu.:0.7481662
Max. :0.998263 Max. :0.997673 Max. :0.9989439 Max. :0.9971417
ah_corr ah_bolus slm_corr efficiency_factor
Min. :0.00459 Min. :0.0002956 Min. :0.003231 Min. :0.002015
1st Qu.:0.25305 1st Qu.:0.2504025 1st Qu.:0.252661 1st Qu.:0.250758
Median :0.49907 Median :0.5000742 Median :0.500295 Median :0.500393
Mean :0.50006 Mean :0.5000755 Mean :0.500044 Mean :0.500111
3rd Qu.:0.75011 3rd Qu.:0.7490912 3rd Qu.:0.748605 3rd Qu.:0.749447
Max. :0.99893 Max. :0.9985065 Max. :0.997172 Max. :0.999536
tidal_mix_max vertical_decay_scale convect_corr bckgrnd_vdc1
Min. :0.000419 Min. :0.001188 Min. :0.001312 Min. :0.002509
1st Qu.:0.251676 1st Qu.:0.249669 1st Qu.:0.249988 1st Qu.:0.249586
Median :0.500322 Median :0.500151 Median :0.500625 Median :0.499080
Mean :0.499984 Mean :0.500032 Mean :0.499933 Mean :0.499944
3rd Qu.:0.749346 3rd Qu.:0.749164 3rd Qu.:0.749569 3rd Qu.:0.750012
Max. :0.999942 Max. :0.997718 Max. :0.997518 Max. :0.999795
bckgrnd_vdc_ban bckgrnd_vdc_eq bckgrnd_vdc_psim Prandtl
Min. :0.000732 Min. :0.0008907 Min. :0.0002194 Min. :0.0002626
1st Qu.:0.249974 1st Qu.:0.2504118 1st Qu.:0.2527390 1st Qu.:0.2497233
Median :0.499959 Median :0.5003844 Median :0.4989547 Median :0.4994308
Mean :0.499945 Mean :0.5000444 Mean :0.5000197 Mean :0.5000212
3rd Qu.:0.747978 3rd Qu.:0.7492564 3rd Qu.:0.7485386 3rd Qu.:0.7497920
Max. :0.999155 Max. :0.9972647 Max. :0.9993059 Max. :0.9996550
outcome
failure: 46
success:494
The summary above shows that all POP2 parameters approximately share the same values for main quantiles.
In the following, we display some graphs that help us in better understanding our data.
The variable of interests is outcome, because we want to predict if failure will occur or not. From barplot 1.1, we can see that the data set is unbalanced, meaning that the failure class is underrepresented.
library(plotly)
x=c("outcome")
Success=sum(climate$outcome=="success")/(nrow(climate))
Failure=sum(climate$outcome=="failure")/(nrow(climate))
dataplot <- data.frame(x, Success, Failure)
fig <- plot_ly(dataplot, x = ~x, y = ~Success, type = 'bar', name = 'Success',text = paste(round(Success,4),'%'), textposition = 'auto',
marker = list(color=alpha('green', 0.2),
line = list(color = 'green', width = 1.5)))
fig <- fig %>% add_trace(y = ~Failure, name = 'Failure',text = paste(round(Failure,4),'%'), textposition = 'auto',
marker = list(color=alpha('red', 0.2),
line = list(color = 'red', width = 1.5)))
fig <- fig %>% layout(yaxis = list(title = 'Frequency'))
fig <- fig %>% layout(xaxis = list(title = ""))
fig
Figure 1.1: Barplot of outcome frequencies.
Indeed, the imbalance ratio (IR), which is defined as the ratio of the size of the majority class (associated to ‘negative’ class) to the size of the minority class (associated to ‘positive’ class), is equal to
cat(sum(climate$outcome=="success")/sum(climate$outcome=="failure"),',')
10.73913 ,
which is much larger than 1. The relative lack of information on the minority class can lead to a misclassification of these elements by classification models. For these reasons, some adjustments will be taken through the analysis.
Clearly, the variables Study and Run are not useful to predict the variable outcome. As for the ocean model parameters, figure 1.2 shows how these parameters vary in the two classes.
library(plotly)
fig <- plot_ly(climate, type = 'box') %>%
add_trace(y = ~climate[climate$outcome=="success",]$vconst_corr, type = "box",fillcolor=alpha('green',0.2),line=list(color="green"),name="Success",visible=T,boxpoints = 'outliers',
marker = list(color ='green'),
line = list(color = 'green')) %>%
add_trace(y = ~climate[climate$outcome=="failure",]$vconst_corr,fillcolor=alpha('red', 0.2),line=list(color="red"),name="Failure",visible=T,boxpoints = 'outliers',
marker = list(color ='red'),
line = list(color = 'red')) %>%
add_trace(y = ~climate[climate$outcome=="success",]$vconst_2, type = "box",fillcolor=alpha('green',0.2),line=list(color="green"),name="Success",visible=F,boxpoints = 'outliers',
marker = list(color ='green'),
line = list(color = 'green')) %>%
add_trace(y = ~climate[climate$outcome=="failure",]$vconst_2,fillcolor=alpha('red', 0.2),line=list(color="red"),name="Failure",visible=F,boxpoints = 'outliers',
marker = list(color ='red'),
line = list(color = 'red')) %>%
add_trace(y = ~climate[climate$outcome=="success",]$vconst_3, type = "box",fillcolor=alpha('green', 0.2),line=list(color="green"),name="Success",visible=F,boxpoints = 'outliers',
marker = list(color ='green'),
line = list(color = 'green')) %>%
add_trace(y = ~climate[climate$outcome=="failure",]$vconst_3,fillcolor=alpha('red', 0.2),line=list(color="red"),name="Failure",visible=F,boxpoints = 'outliers',
marker = list(color ='red'),
line = list(color = 'red')) %>%
add_trace(y = ~climate[climate$outcome=="success",]$vconst_4, type = "box",fillcolor=alpha('green', 0.2),line=list(color="green"),name="Success",visible=F,boxpoints = 'outliers',
marker = list(color ='green'),
line = list(color = 'green')) %>%
add_trace(y = ~climate[climate$outcome=="failure",]$vconst_4,fillcolor=alpha('red', 0.2),line=list(color="red"),name="Failure",visible=F,boxpoints = 'outliers',
marker = list(color ='red'),
line = list(color = 'red')) %>%
add_trace(y = ~climate[climate$outcome=="success",]$vconst_5, type = "box",fillcolor=alpha('green', 0.2),line=list(color="green"),name="Success",visible=F,boxpoints = 'outliers',
marker = list(color ='green'),
line = list(color = 'green')) %>%
add_trace(y = ~climate[climate$outcome=="failure",]$vconst_5,fillcolor=alpha('red', 0.2),line=list(color="red"),name="Failure",visible=F,boxpoints = 'outliers',
marker = list(color ='red'),
line = list(color = 'red')) %>%
add_trace(y = ~climate[climate$outcome=="success",]$vconst_7, type = "box",fillcolor=alpha('green', 0.2),line=list(color="green"),name="Success",visible=F,boxpoints = 'outliers',
marker = list(color ='green'),
line = list(color = 'green')) %>%
add_trace(y = ~climate[climate$outcome=="failure",]$vconst_7,fillcolor=alpha('red', 0.2),line=list(color="red"),name="Failure",visible=F,boxpoints = 'outliers',
marker = list(color ='red'),
line = list(color = 'red')) %>%
add_trace(y = ~climate[climate$outcome=="success",]$ah_corr, type = "box",fillcolor=alpha('green', 0.2),line=list(color="green"),name="Success",visible=F,boxpoints = 'outliers',
marker = list(color ='green'),
line = list(color = 'green')) %>%
add_trace(y = ~climate[climate$outcome=="failure",]$ah_corr,fillcolor=alpha('red', 0.2),line=list(color="red"),name="Failure",visible=F,boxpoints = 'outliers',
marker = list(color ='red'),
line = list(color = 'red')) %>%
add_trace(y = ~climate[climate$outcome=="success",]$ah_bolus, type = "box",fillcolor=alpha('green', 0.2),line=list(color="green"),name="Success",visible=F,boxpoints = 'outliers',
marker = list(color ='green'),
line = list(color = 'green')) %>%
add_trace(y = ~climate[climate$outcome=="failure",]$ah_bolus,fillcolor=alpha('red', 0.2),line=list(color="red"),name="Failure",visible=F,boxpoints = 'outliers',
marker = list(color ='red'),
line = list(color = 'red')) %>%
add_trace(y = ~climate[climate$outcome=="success",]$slm_corr, type = "box",fillcolor=alpha('green', 0.2),line=list(color="green"),name="Success",visible=F,boxpoints = 'outliers',
marker = list(color ='green'),
line = list(color = 'green')) %>%
add_trace(y = ~climate[climate$outcome=="failure",]$slm_corr,fillcolor=alpha('red', 0.2),line=list(color="red"),name="Failure",visible=F,boxpoints = 'outliers',
marker = list(color ='red'),
line = list(color = 'red')) %>%
add_trace(y = ~climate[climate$outcome=="success",]$efficiency_factor, type = "box",fillcolor=alpha('green', 0.2),line=list(color="green"),name="Success",visible=F,boxpoints = 'outliers',
marker = list(color ='green'),
line = list(color = 'green'))%>%
add_trace(y = ~climate[climate$outcome=="failure",]$efficiency_factor,fillcolor=alpha('red', 0.2),line=list(color="red"),name="Failure",visible=F,boxpoints = 'outliers',
marker = list(color ='red'),
line = list(color = 'red')) %>%
add_trace(y = ~climate[climate$outcome=="success",]$tidal_mix_max, type = "box",fillcolor=alpha('green', 0.2),line=list(color="green"),name="Success",visible=F,boxpoints = 'outliers',
marker = list(color ='green'),
line = list(color = 'green'))%>%
add_trace(y = ~climate[climate$outcome=="failure",]$tidal_mix_max,fillcolor=alpha('red', 0.2),line=list(color="red"),name="Failure",visible=F,boxpoints = 'outliers',
marker = list(color ='red'),
line = list(color = 'red')) %>%
add_trace(y = ~climate[climate$outcome=="success",]$vertical_decay_scale, type = "box",fillcolor=alpha('green', 0.2),line=list(color="green"),name="Success",visible=F,boxpoints = 'outliers',
marker = list(color ='green'),
line = list(color = 'green'))%>%
add_trace(y = ~climate[climate$outcome=="failure",]$vertical_decay_scale,fillcolor=alpha('red', 0.2),line=list(color="red"),name="Failure",visible=F,boxpoints = 'outliers',
marker = list(color ='red'),
line = list(color = 'red'))%>%
add_trace(y = ~climate[climate$outcome=="success",]$convect_corr, type = "box",fillcolor=alpha('green', 0.2),line=list(color="green"),name="Success",visible=F,boxpoints = 'outliers',
marker = list(color ='green'),
line = list(color = 'green'))%>%
add_trace(y = ~climate[climate$outcome=="failure",]$convect_corr,fillcolor=alpha('red', 0.2),line=list(color="red"),name="Failure",visible=F,boxpoints = 'outliers',
marker = list(color ='red'),
line = list(color = 'red'))%>%
add_trace(y = ~climate[climate$outcome=="success",]$bckgrnd_vdc1, type = "box",fillcolor=alpha('green', 0.2),line=list(color="green"),name="Success",visible=F,boxpoints = 'outliers',
marker = list(color ='green'),
line = list(color = 'green'))%>%
add_trace(y = ~climate[climate$outcome=="failure",]$bckgrnd_vdc1,fillcolor=alpha('red', 0.2),line=list(color="red"),name="Failure",visible=F,boxpoints = 'outliers',
marker = list(color ='red'),
line = list(color = 'red'))%>%
add_trace(y = ~climate[climate$outcome=="success",]$bckgrnd_vdc_ban, type = "box",fillcolor=alpha('green', 0.2),line=list(color="green"),name="Success",visible=F,boxpoints = 'outliers',
marker = list(color ='green'),
line = list(color = 'green'))%>%
add_trace(y = ~climate[climate$outcome=="failure",]$bckgrnd_vdc_ban,fillcolor=alpha('red', 0.2),line=list(color="red"),name="Failure",visible=F,boxpoints = 'outliers',
marker = list(color ='red'),
line = list(color = 'red'))%>%
add_trace(y = ~climate[climate$outcome=="success",]$bckgrnd_vdc_eq, type = "box",fillcolor=alpha('green', 0.2),line=list(color="green"),name="Success",visible=F,boxpoints = 'outliers',
marker = list(color ='green'),
line = list(color = 'green'))%>%
add_trace(y = ~climate[climate$outcome=="failure",]$bckgrnd_vdc_eq,fillcolor=alpha('red', 0.2),line=list(color="red"),name="Failure",visible=F,boxpoints = 'outliers',
marker = list(color ='red'),
line = list(color = 'red'))%>%
add_trace(y = ~climate[climate$outcome=="success",]$bckgrnd_vdc_psim, type = "box",fillcolor=alpha('green', 0.2),line=list(color="green"),name="Success",visible=F,boxpoints = 'outliers',
marker = list(color ='green'),
line = list(color = 'green'))%>%
add_trace(y = ~climate[climate$outcome=="failure",]$bckgrnd_vdc_psim,fillcolor=alpha('red', 0.2),line=list(color="red"),name="Failure",visible=F,boxpoints = 'outliers',
marker = list(color ='red'),
line = list(color = 'red'))%>%
add_trace(y = ~climate[climate$outcome=="success",]$Prandtl, type = "box",fillcolor=alpha('green', 0.2),line=list(color="green"),name="Success",visible=F,boxpoints = 'outliers',
marker = list(color ='green'),
line = list(color = 'green'))%>%
add_trace(y = ~climate[climate$outcome=="failure",]$Prandtl,fillcolor=alpha('red', 0.2),line=list(color="red"),name="Failure",visible=F,boxpoints = 'outliers',
marker = list(color ='red'),
line = list(color = 'red'))
vis=rep(FALSE,39)
fig <- fig %>% layout(
title = "",
xaxis = list(domain = c(0.1, 1)),
yaxis = list(title = "vconst_corr"),
updatemenus = list(
list(
y = 0.8,
active = 0,
buttons = list(
list(method = "update",
args = list( list(visible=c(vis[1],T,T,vis[4:37])), list(yaxis = list(title = "vconst_corr"))),
label = "vconst_corr",
title = "vconst_corr"
),
list(method = "update",
args = list(list(visible=c(vis[1:3],T,T,vis[6:37])), list(yaxis = list(title = "vconst_2"))),
label = "vconst_2",
title="vconst_2"
),
list(method = "update",
args = list(list(visible=c(vis[1:5],T,T,vis[8:37])), list(yaxis = list(title = "vconst_3"))),
label = "vconst_3",
title="vconst_3"
),
list(method = "update",
args = list(list(visible=c(vis[1:7],T,T,vis[10:37])), list(yaxis = list(title = "vconst_4"))),
label = "vconst_4",
title="vconst_4"
),
list(method = "update",
args = list(list(visible=c(vis[1:9],T,T,vis[12:37])), list(yaxis = list(title = "vconst_5"))),
label = "vconst_5",
title="vconst_5"
),
list(method = "update",
args = list(list(visible=c(vis[1:11],T,T,vis[14:37])), list(yaxis = list(title = "vconst_7"))),
label = "vconst_7",
title="vconst_7"
),
list(method = "update",
args = list(list(visible=c(vis[1:13],T,T,vis[16:37])), list(yaxis = list(title = "ah_corr"))),
label = "ah_corr",
title="ah_corr"
),
list(method = "update",
args = list(list(visible=c(vis[1:15],T,T,vis[18:37])), list(yaxis = list(title = "ah_bolus"))),
label = "ah_bolus",
title="ah_bolus"
),
list(method = "update",
args = list(list(visible=c(vis[1:17],T,T,vis[20:37])), list(yaxis = list(title = "slm_corr"))),
label = "slm_corr",
title="slm_corr"
),
list(method = "update",
args = list(list(visible=c(vis[1:19],T,T,vis[22:37])), list(yaxis = list(title = "efficiency_factor"))),
label = "efficiency_factor",
title="efficiency_factor"
),
list(method = "update",
args = list(list(visible=c(vis[1:21],T,T,vis[24:37])), list(yaxis = list(title = "tidal_mix_max"))),
label = "tidal_mix_max",
title="tidal_mix_max"
),
list(method = "update",
args = list(list(visible=c(vis[1:23],T,T,vis[26:37])), list(yaxis = list(title = "vertical_decay_scale"))),
label = "vertical_decay_scale",
title="vertical_decay_scale"
),
list(method = "update",
args = list(list(visible=c(vis[1:25],T,T,vis[28:37])), list(yaxis = list(title = "convect_corr"))),
label = "convect_corr",
title="convect_corr"
),
list(method = "update",
args = list(list(visible=c(vis[1:27],T,T,vis[30:37])), list(yaxis = list(title = "bckgrnd_vdc1"))),
label = "bckgrnd_vdc1",
title="bckgrnd_vdc1"
),
list(method = "update",
args = list(list(visible=c(vis[1:29],T,T,vis[32:37])), list(yaxis = list(title = "bckgrnd_vdc_ban"))),
label = "bckgrnd_vdc_ban",
title="bckgrnd_vdc_ban"
),
list(method = "update",
args = list(list(visible=c(vis[1:31],T,T,vis[34:37])), list(yaxis = list(title = "bckgrnd_vdc_eq"))),
label = "bckgrnd_vdc_eq",
title="bckgrnd_vdc_eq"
),
list(method = "update",
args = list(list(visible=c(vis[1:33],T,T,vis[36:37])), list(yaxis = list(title = "bckgrnd_vdc_psim"))),
label = "bckgrnd_vdc_psim",
title="bckgrnd_vdc_psim"
),
list(method = "update",
args = list(list(visible=c(vis[1:35],T,T,vis[38:37])), list(yaxis = list(title = "Prandtl"))),
label = "Prandtl",
title="Prandtl"
)
)
)
)
)
fig
Figure 1.2: boxplots of ocean model parameters.
At first glance, by figure 1.3, one can notice that:
failures tend to occur for high values of the predictors vconst_corr, vconst_2, and convect_corr;
when failures occur, the variable bckgrnd_vdc1 takes values that are generally low.
library(plotly)
fig1 <- plot_ly(y = ~climate[climate$outcome=="success",]$vconst_corr, type = "box",fillcolor=alpha('green', 0.2),line=list(color="green"),name="Success",boxpoints = 'outliers',
marker = list(color ='green'),
line = list(color = 'green'))
fig1 <- fig1 %>% add_trace(y = ~climate[climate$outcome=="failure",]$vconst_corr,fillcolor=alpha('red', 0.2),line=list(color="red"),name="Failure",boxpoints = 'outliers',
marker = list(color ='red'),
line = list(color = 'red'))
fig2 <- plot_ly(y = ~climate[climate$outcome=="success",]$vconst_2, type = "box",fillcolor=alpha('green', 0.2),line=list(color="green"),name="Success",boxpoints = 'outliers',
marker = list(color ='green'),
line = list(color = 'green'))
fig2 <- fig2 %>% add_trace(y = ~climate[climate$outcome=="failure",]$vconst_2,fillcolor=alpha('red', 0.2),line=list(color="red"),name="Failure",boxpoints = 'outliers',
marker = list(color ='red'),
line = list(color = 'red'))
fig3 <- plot_ly(y = ~climate[climate$outcome=="success",]$convect_corr, type = "box",fillcolor=alpha('green', 0.2),line=list(color="green"),name="Success",boxpoints = 'outliers',
marker = list(color ='green'),
line = list(color = 'green'))
fig3 <- fig3 %>% add_trace(y = ~climate[climate$outcome=="failure",]$convect_corr,fillcolor=alpha('red', 0.2),line=list(color="red"),name="Failure",boxpoints = 'outliers',
marker = list(color ='red'),
line = list(color = 'red'))
fig4 <- plot_ly(y = ~climate[climate$outcome=="success",]$bckgrnd_vdc1, type = "box",fillcolor=alpha('green', 0.2),line=list(color="green"),name="Success",boxpoints = 'outliers',
marker = list(color ='green'),
line = list(color = 'green'))
fig4 <- fig4 %>% add_trace(y = ~climate[climate$outcome=="failure",]$bckgrnd_vdc1,fillcolor=alpha('red', 0.2),line=list(color="red"),name="Failure",boxpoints = 'outliers',
marker = list(color ='red'),
line = list(color = 'red'))
fig<-subplot(fig1,fig2,fig3,fig4,nrows=2)
fig <- fig %>% layout(
yaxis = list(
title = "vconst_corr"
),
xaxis=list(domain=c(0,0.45)),
yaxis2 = list(
title = "vconst_2"
),
xaxis2=list(domain=c(0.55,1)),
yaxis3 = list(
title = "convect_corr"
),
xaxis3=list(domain=c(0,0.45)),
yaxis4 = list(
title = "bckgrnd_vdc1"
),
xaxis4=list(domain=c(0.55,1)),
showlegend= FALSE
)
fig
Figure 1.3: boxplots of most influent ocean model parameters.
All the boxplots relative to other variables are overlapped, so we cannot conclude if there is a significant difference in the distributions of these predictors in the two classes.
We now plot a heatmap pairwise correlation in order to visualize Pearson correlation among all possible pairs of predictors:
mcor=cor(climate[,3:20])
fig <- plot_ly(z = mcor, colors = colorRamp(c("cyan","white","blue")), type = "heatmap",x=colnames(climate[,3:20]),y=colnames(climate[,3:20]))
fig <- fig %>% colorbar(title = "", limits = c(-1,1))
fig
Figure 1.4: Heatmap shows correlation between features.
We also computed the Variance Inflation Factor (VIF) in order to assess whether our data set exhibits multicollinearity among some of its predictors. The VIF for each feature can be computed using the following formula:
\[ \operatorname{VIF}(X_j)=\frac{1}{1-R_{X_j|X_{-j}}^2} \]
where \(R_{X_j|X_{-j}}^2\) is the \(R^2\) index from a regression of \(X_j\) onto all the other predictors. In general, a \(\operatorname{VIF}\) equal to \(5\) indicates that there is a high amount of collinearity.
library(car)
regLogAll=glm(outcome~.,data=climate[,3:21],family=binomial)
vif(regLogAll)
vconst_corr vconst_2 vconst_3
3.067922 2.664454 1.343272
vconst_4 vconst_5 vconst_7
1.244214 1.524105 1.259427
ah_corr ah_bolus slm_corr
1.077809 1.158749 1.217025
efficiency_factor tidal_mix_max vertical_decay_scale
1.232095 1.277516 1.281522
convect_corr bckgrnd_vdc1 bckgrnd_vdc_ban
2.013272 2.112592 1.170947
bckgrnd_vdc_eq bckgrnd_vdc_psim Prandtl
1.721396 1.419228 1.110827
The variables contained in the data set show very low to no correlation. This is not surprising, as some parameters were created precisely for the purpose of eliminating correlations between other additional parameters, as highlighted in the article [1]. For example, values drawn for vconst_corr are assigned to vconst_1 and vconst_6.
In the subsections below, we construct some classification models, namely Logistic Regression, \(k\)-NN, SVM and Random Forest.
The idea of who created the data set was to use the observations of the first two study as training set and the remaining one as test set. We will perform our analysis in the same fashion.
So, using validation set approach, the models mentioned above will be trained on the training set and then we will compare them on the test set.
The outputs below show that variables maintain approximately the same distribution in the training and test sets.
data.train=climate[1:360,-c(1,2)]
data.test=climate[361:540,-c(1,2)]
cat('Number of failures:',sum(ifelse(data.train[,ncol(data.train)]=="failure",1,0)),', number of successes:',sum(ifelse(data.train[,ncol(data.train)]=="success",1,0)),'and their proportion:',sum(ifelse(data.train[,ncol(data.train)]=="failure",1,0))/sum(ifelse(data.train[,ncol(data.train)]=="success",1,0)),'in studies 1,2.')
Number of failures: 32 , number of successes: 328 and their proportion: 0.09756098 in studies 1,2.
cat('Number of failures:',sum(ifelse(data.test[,ncol(data.test)]=="failure",1,0)),', number of successes:',sum(ifelse(data.test[,ncol(data.test)]=="success",1,0)),'and their proportion:',sum(ifelse(data.test[,ncol(data.test)]=="failure",1,0))/sum(ifelse(data.test[,ncol(data.test)]=="success",1,0)),'in study 3.')
Number of failures: 14 , number of successes: 166 and their proportion: 0.08433735 in study 3.
cat('Differences between the main statistical measures of the training data set and the test set:')
Differences between the main statistical measures of the training data set and the test set:
stat_meas_train=do.call(cbind,lapply(data.train[,1:18],summary))
stat_meas_test=do.call(cbind,lapply(data.test[,1:18],summary))
round(stat_meas_train-stat_meas_test,3)
vconst_corr vconst_2 vconst_3 vconst_4 vconst_5 vconst_7 ah_corr
Min. 0.001 0.001 -0.004 -0.001 -0.003 0.001 0.000
1st Qu. -0.003 -0.001 -0.001 -0.002 -0.002 0.003 -0.001
Median -0.002 -0.001 -0.001 0.001 0.003 -0.002 0.000
Mean 0.000 0.000 0.000 0.000 0.000 0.000 0.000
3rd Qu. -0.001 0.003 -0.001 0.003 -0.003 0.000 0.003
Max. -0.002 0.001 0.004 0.003 0.004 0.001 0.001
ah_bolus slm_corr efficiency_factor tidal_mix_max vertical_decay_scale
Min. 0.001 -0.001 -0.002 0.000 0.000
1st Qu. 0.003 -0.001 0.000 0.001 -0.002
Median 0.002 0.001 0.000 0.000 0.002
Mean 0.000 0.000 0.000 0.000 0.000
3rd Qu. -0.002 0.000 0.000 0.002 0.003
Max. 0.003 0.002 -0.004 -0.005 0.000
convect_corr bckgrnd_vdc1 bckgrnd_vdc_ban bckgrnd_vdc_eq
Min. 0.003 -0.001 -0.004 -0.004
1st Qu. -0.002 0.002 -0.002 -0.002
Median 0.002 0.002 0.000 0.000
Mean 0.000 0.000 0.000 0.000
3rd Qu. 0.001 0.000 -0.001 0.000
Max. 0.002 0.000 -0.001 0.002
bckgrnd_vdc_psim Prandtl
Min. 0.000 -0.004
1st Qu. 0.000 -0.001
Median -0.001 0.002
Mean 0.000 0.000
3rd Qu. -0.001 0.003
Max. 0.003 0.003
As pointed out above, we deal with imbalanced data. In our setting, as in most problems dealing with unbalanced data, it is far more important to accurately predict or identify the rarer case than the more common ones, and this is reflected in the costs associated with errors in the predictions. For example, in our case correctly predicting a failure event is much more valuable than a success one, since it allows us to prevent a crash system. From this perspective, making a mistake on a minority class element “costs” more. Unfortunately, in many cases we do not have an access to a domain expert and no a priori information on these costs.
In particular, our problem is said to be a problem with absolute rarity, i.e. a problem which regarded not only the relative proportions of examples belonging to each class, but also the low number of positive examples which are available, as defined in [2].
If defining performance measures is a necessary tool for comparing various classification models, it is particularly important for problems with class imbalance. In these cases, as noted earlier, the costs of errors are often asymmetric and quite skewed, which violates the default assumption of most classifier induction algorithms, which is that errors have uniform cost and thus accuracy should be optimized. Using accuracy as an evaluation metric in the presence of a class imbalance means that, in most cases, poor minority class performance is mistaken for better utilization of the majority class performance. This makes sense from an optimization standpoint, since overall accuracy is the weighted average of the accuracy associated with each class, where the weights are based on the proportion of training examples belonging to each class.
A variety of metrics are routinely used when learning from imbalanced data when accurate evaluation information about costs is not available. The most common metric involves ROC (Receiver Operating Characteristic) analysis and AUC, the area under the ROC curve. The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various classification threshold settings. The true-positive rate is also known as sensitivity or recall, while the false-positive rate can be calculated as \(1 − \textbf{specificity}\). ROC curve analysis can sometimes identify optimal models and discard suboptimal ones, independently from the cost context or the class distribution (i.e., if one ROC curve dominates another), although in practice ROC curves tend to intersect, so that there is no one dominant model.
Other common metrics used for imbalanced learning are based upon precision and recall. Precision summarizes the fraction of examples assigned to the positive class that actually belong to it. For imbalanced learning, recall is typically used to measure the coverage of the minority class. Thus, precision and recall make possible to assess the performance of a classifier on the minority class. There are metrics that combine precision and recall, like F1-measure, which weights precision and recall equally. Typically, one can also generate precision and recall curves, similar to the ROC curve. Metrics based on precision and recall are more sensitive measures of classification performance when there are imbalanced classes, as pointed out in http://dpmartin42.github.io/posts/r/imbalanced-classes-part-2.
The formulas for evaluating the previously mentioned metrics are shown in the figure 2.1.
layout(matrix(c(1,1,2)))
par(mar=c(2,2,2,2))
plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')
title('CONFUSION MATRIX', cex.main=2)
# create the matrix
rect(150, 430, 240, 370, col=alpha('blue',1))
text(195, 435, 'Failures', cex=1.2)
rect(250, 430, 340, 370, col=alpha('orange',1))
text(295, 435, 'Successes', cex=1.2)
text(125, 370, 'Actual class', cex=1.3, srt=90, font=2)
text(245, 450, 'Predicted class', cex=1.3, font=2)
rect(150, 305, 240, 365, col=alpha('orange',1))
rect(250, 305, 340, 365, col=alpha('blue',1))
text(140, 400, 'Failures', cex=1.2, srt=90)
text(140, 335, 'Successes', cex=1.2, srt=90)
# add in the cm results
#res <- as.numeric(cm$table)
text(195, 400, "TP\n True Positive", cex=1.6, font=2, col='white')
text(195, 335, "FP\n False Positive\n Type I error", cex=1.6, font=2, col='white')
text(295, 400, "FN\n False Negative\n Type II error", cex=1.6, font=2, col='white')
text(295, 335, "TN\n True Negative", cex=1.6, font=2, col='white')
# add in the specifics
plot(c(100, 0), c(100, 0), type = "n", xlab="", ylab="", main = "DETAILS", xaxt='n', yaxt='n')
text(5, 75, "Sensitivity", cex=1.2, font=2)
text(5, 35, expression(frac('TP','TP+FN')), cex=1.2)
text(20, 75, "Specificity", cex=1.2, font=2)
text(20, 35, expression(frac('TN','TN+FP')), cex=1.2)
text(35, 75, "Precision", cex=1.2, font=2)
text(35, 35, expression(frac('TP','TP+FP')), cex=1.2)
text(52, 75, "Negative Predictive\n value", cex=1.2, font=2)
text(52, 35, expression(frac('TN','TN+FN')), cex=1.2)
# add in the accuracy information
text(70, 75, "Accuracy", cex=1.2, font=2)
text(70, 35, expression(frac('TP+TN','TP+TN+FP+FN')), cex=1.2)
text(90, 75, "F1", cex=1.2, font=2)
text(90, 35, expression(frac('2',frac('1','Precision')+frac('1','Sensitivity'))), cex=1.2)
Figure 2.1: Confusion matrix and some measures of performance.
Sampling methods are very popular when dealing with imbalanced data. These methods are primarily employed to address the problem with relative rarity (as defined in [2]) but do not address the issue of absolute rarity. Except for some algorithms that use intelligent methods for generating new examples, these techniques do not tackle the underlying issue with absolute rarity, i.e. a lack of examples belonging to the rare classes and rare cases. Rather, they masks the underlying problem by artificially balancing the data, without solving the basic underlying issue. The most basic sampling methods are random undersampling and random oversampling, which are described below.
random undersampling consists in sampling without repetition some instances belonging to the majority class, in such a way that the number of observations of the latter is equal to the number of observations of the minority class. Random undersampling could lead to unsatisfactory results as part of the information is lost, since it extracts only a part of the observations;
random oversampling is a method that aims to balance the class distribution through the random replication of examples belonging to the minority class. Despite we avoid losing information with this approach, we risk to overfit data. This would lead to an overestimation of our model performance and generalizability.
More advanced sampling methods use some more sophisticated techniques to remove or add examples, like SMOTE:
Mentioning [2], we would like to point out that “Artificially balancing the training distribution may help with the effects of class imbalance, but does not remove the underlying problem”.
Other ways to deal with imbalanced data consists in using evaluation metrics that properly value the learned/mined knowledge, like F1 and cost-sensitive learning methods, where cost means the penalty associated with an incorrect prediction.
Furthermore, a more important complication arises from the overlapping of simulation successes and failures, as pointed out by the first column in figure 2.3 that shows some projections in low dimensional space of our training data. Another issue is given by isolated failure events shown in the same scatter plots.
Such phenomena can lead to serious misclassification errors of both types. For this reason, balancing precision and recall is very important. In SMOTE, every positive instance in a training set is selected for creating new synthetic instances, regardless of the location of nearby negative instances. In Safe-Level-SMOTE, one of the variations of SMOTE, the number of positive neighbors for each positive instance is counted and defined as its safe-level value. Positive instances which have safe-level values equal to zero are considered as noise and they are dropped. For the remaining positive instances, their safe-level values are used in the synthetic process to determine the possible location of a synthetic instance. Recursively, for each positive instance \(x_i\) with non zero safe-level one of its positive neighbor \(x_{i_j}\) will be chosen. Then, a synthetic instance is created on a random position of the possible range. This is determined by safe-level ratio, i.e. the ratio between safe-level of \(x_i\) over safe-level of \(x_{i_j}\), as shown in Table 2.1 found in [3]. The ranges are constructed in such a way that the synthetic instance locates close to the observation with the highest safe-level value. After every positive instance is used, synthetic instances are added into the imbalanced data set. The process is recursively applied until the data set is nearly balanced.
| Safe-level of \(x_i\) | Safe-level of \(x_{i_j}\) | Self-level ratio | The possible range |
|---|---|---|---|
| \(\neq 0\) | \(= 0\) | \(\infty\) | \([x_i,x_i]\) |
| \(\neq 0\) | \(\neq 0\) | =1 | \([x_i,x_{i_j}]\) |
| \(\neq 0\) | \(\neq 0\) | >1 | \([x_i, x_i+\frac{x_{i_j}-x_i}{\text{Self-level ratio}}]\) |
| \(\neq 0\) | \(\neq 0\) | <1 | \([x_{i_j}-(\text{Self-level ratio})(x_{i_j}-x_i),x_{i_j}]\) |
Safe-Level SMOTE indirectly acknowledges the existence of negative instances located around the positive instance. In fact, these synthetic instances can possibly be generated too close to some negative instances. To prevent this situation, after randomly placing the synthetic instance on the range defined above, the distances between its surrounding negatives (which are negative neighbors of \(x_i\) and \(x_{i_j}\)) and itself are taken into consideration. If the distance from its closest negative neighbor is less than the distance from its closest range extreme, then this new synthetic instance is reallocated closer to the latter. An example of this procedure, called Relocating Safe-Level SMOTE (RSLS), is shown in figure 2.2, found always in [3], while the application of this technique to the training data set is shown in the third column of figure 2.3.
Figure 2.2: Example of relocating a synthetic instance.
Generally, an isolated data point surrounded by different class instances and placed far away from elements of the same class is considered as noise. Some classification methods ignore noise instances during classification process. Despite noise instances might belong to both classes, only the ones that come from the positive one are considered as minority outcast instances. Since every positive instance is rare and essential in a class imbalance problem, removing these minority outcast instances decreases the positive recognition rate. SMOTE does not deal with minority outcast and it generates synthetic instances from every positive instance, while RSLS drop these minority outcast instances from the synthetic process. The RSLS training set without outcast is shown in the last column of figure 2.3. In order to achieve a better recall and precision, the authors of article [3] still proposed of excluding these minority outcast instances from the synthetic process (and then training classifier over RSLS training set without outcasts), but to reuse them later with the negative instances to train an auxiliary \(1\)-NN model. If this \(1\)-NN model classifies an unknown instance of test set as positive, then this instance is classified as positive for the whole classification process. This procedure is called RSLS handling outcasts (RSLSho).
Among all the SMOTE versions designed to deal with these specific problems contained in smotefamily R package, whose documentations is available at https://cran.r-project.org/web/packages/smotefamily/smotefamily.pdf, we chose RSLSho after comparing its results with the ones of the other variations. This selection process was done by looking at figures like 2.3. Finally, the best value for \(k\) employed into SMOTE and RSLS procedure was assessed by considering both the modified data sets returned by these techniques (again looking at 2.3) and the performances of classifiers on them.
library(smotefamily)
library(dplyr)
x=c("outcome")
numClass=ifelse(data.train$outcome=="success",1,0)
SMOTEmodel <- smotefamily::RSLS(data.train[,-19],data.train[,19],K=5,C=5,dupSize = 0)
outcast=SMOTEmodel$outcast
colnames(outcast)[19]="outcome"
RSLStrain=SMOTEmodel$data
colnames(RSLStrain)[19]="outcome"
levels(RSLStrain$outcome)=c("failure","success")
RSLSWOtrain=anti_join(RSLStrain,outcast)
RSLSWOtrain$outcome=as.factor(RSLSWOtrain$outcome)
levels(RSLSWOtrain$outcome)=c("failure","success")
negOutTrain=union(SMOTEmodel$orig_N,SMOTEmodel$outcast)
colnames(negOutTrain)[19]="outcome"
negOutTrain$outcome=as.factor(negOutTrain$outcome)
SMOTEtrain<-smotefamily::SMOTE(data.train[,-19],data.train[,19],K=5,dup_size = 0)
SMOTEtrain=SMOTEtrain$data
colnames(SMOTEtrain)[19]="outcome"
SMOTEtrain$outcome=as.factor(SMOTEtrain$outcome)
levels(SMOTEtrain$outcome)=c("failure","success")
Success=sum(data.train$outcome=="success")/(nrow(data.train))
Failure=sum(data.train$outcome=="failure")/(nrow(data.train))
dataplot1 <- data.frame(x, Success, Failure)
Success=sum(SMOTEtrain$outcome=="success")/(nrow(SMOTEtrain))
Failure=sum(SMOTEtrain$outcome=="failure")/(nrow(SMOTEtrain))
dataplot2 <- data.frame(x, Success, Failure)
Success=sum(RSLSWOtrain$outcome=="success")/(nrow(RSLSWOtrain))
Failure=sum(RSLSWOtrain$outcome=="failure")/(nrow(RSLSWOtrain))
dataplot3 <- data.frame(x, Success, Failure)
colo=as.factor(ifelse(data.train$outcome=="success","green","red"))
cols3=as.factor(ifelse(SMOTEtrain$outcome=="success","green","red"))
cols2=as.factor(ifelse(RSLStrain$outcome=="success","green","red"))
cols=as.factor(ifelse(RSLSWOtrain$outcome=="success","green","red"))
fig1 <- plot_ly(type='scatter', mode="markers", x=~data.train$vconst_corr, y=~data.train$vconst_2, marker=list( color=colo,sizeref=0.1, opacity=0.5),name="original")
fig2 <- plot_ly(type='scatter', mode="markers", x=~data.train$vconst_corr, y=~data.train$convect_corr, marker=list( color=colo,sizeref=0.1, opacity=0.5),name="original")
fig3 <-plot_ly(type='scatter', mode="markers", x=~data.train$vconst_corr, y=~data.train$bckgrnd_vdc1, marker=list( color=colo,sizeref=0.1, opacity=0.5),name="original")
fig4 <- plot_ly(type='scatter', mode="markers",x=~SMOTEtrain$vconst_corr, y=~SMOTEtrain$vconst_2, marker=list( color=cols3, sizeref=0.1, opacity=0.5),name="SMOTE")
fig5 <-plot_ly(type='scatter', mode="markers",x=~SMOTEtrain$vconst_corr, y=~SMOTEtrain$convect_corr, marker=list( color=cols3, sizeref=0.1, opacity=0.5),name="SMOTE")
fig6 <- plot_ly(type='scatter', mode="markers",x=~SMOTEtrain$vconst_corr, y=~SMOTEtrain$bckgrnd_vdc1, marker=list( color=cols3, sizeref=0.1, opacity=0.5),name="SMOTE")
fig7 <- plot_ly(type='scatter', mode="markers",x=~RSLStrain$vconst_corr, y=~RSLStrain$vconst_2, marker=list( color=cols2, sizeref=0.1, opacity=0.5),name="RSLS")
fig8 <-plot_ly(type='scatter', mode="markers",x=~RSLStrain$vconst_corr, y=~RSLStrain$convect_corr, marker=list( color=cols2, sizeref=0.1, opacity=0.5),name="RSLS")
fig9 <- plot_ly(type='scatter', mode="markers",x=~RSLStrain$vconst_corr, y=~RSLStrain$bckgrnd_vdc1, marker=list( color=cols2, sizeref=0.1, opacity=0.5),name="RSLS")
fig10 <- plot_ly(type='scatter', mode="markers",x=~RSLSWOtrain$vconst_corr, y=~RSLSWOtrain$vconst_2, marker=list( color=cols, sizeref=0.1, opacity=0.5),name="RSLS w/o outcasts")
fig11 <-plot_ly(type='scatter', mode="markers",x=~RSLSWOtrain$vconst_corr, y=~RSLSWOtrain$convect_corr, marker=list( color=cols, sizeref=0.1, opacity=0.5),name="RSLS w/o outcasts")
fig12 <- plot_ly(type='scatter', mode="markers",x=~RSLSWOtrain$vconst_corr, y=~RSLSWOtrain$bckgrnd_vdc1, marker=list( color=cols, sizeref=0.1, opacity=0.5),name="RSLS w/o outcasts")
s1 <- subplot(fig1,fig4,fig7,fig10,fig2,fig5,fig8,fig11,fig3,fig6,fig9,fig12,nrows=3,shareX=T,shareY=T)
s1 <- s1 %>% layout(
yaxis = list(
title = "vconst_2"
),
xaxis=list(title="vconst_corr (original)"),
yaxis2 = list(
title = "convect_corr"
),
xaxis2=list(title="vconst_corr (SMOTE)"),
yaxis3 = list(
title = "bckgrnd_vdc1"
),
xaxis3=list(title="vconst_corr (RSLS)"),
xaxis4=list(title="vconst_corr \n(RSLS w/o outcasts)"),
showlegend= FALSE
)
s1
Figure 2.3: Some scatter plots of training data set before and after applying rasampling techniques.
Barplots represented in figure 2.4 show proportions of positive class in the original training set, in the training set after applying SMOTE and in the training set after applying RSLS and removing outcasts.
Figure 2.4: Barplots of outcome frequencies in original, SMOTE and RSLS w/o outcasts training set.
In the following, we will use the classifiers and some of the techniques mentioned above for dealing with imbalanced data.
Logistic regression exploits the logistic function in order to model the conditional probability of the two classes given the predictors:
\[ p(X)=\mathbb{P}(Y=1|X)= \frac{ \exp( \beta_0+ ...+ \beta_dX_d)}{1+\exp( \beta_0+ ...+ \beta_dX_d)}. \] The main advantages that we get by choosing to use the logistic function to estimate \(\mathbb{P}(Y=1|X)\) are that:
its output always ranges between \(0\) and \(1\);
it is very easy to deal with it, since it is continuous and monotonically increasing/decreasing;
we can easily reshape it by changing the regression coefficients.
Notice that the log-odds linearly depend on the predictors: \[ \log\left(\frac{p(X)}{1 - p(X)}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_d X_d. \] In our case, \(p(x)\) is the probability for the outcome variable to take value “success”, so it is the probability for a simulation to not fail. This means that:
the probability of a crash in the simulation gets higher as predictors with negative regression coefficients increase;
the probability of a crash in the simulation decreases as predictors with positive regression coefficients increase.
Maximum likelihood estimation is performed in order to retrieve approximations for both the probabilities to belong to each class for each observation and the parameters involved in the model. In this case that the likelihood function is given by:
\[ \mathcal{L}\left(\beta_{0},\ldots, \beta_{d} |y\right)=\prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i} \]
Now, we find the best possible logistic regression model using the step function of R library stats, which chooses the model associated to the lowest AIC index value. Notice that we decided not to taking into consideration interaction terms in our model. This is due to the fact that all the predictors are uncorrelated one to each other, hence any interaction would be very difficult to interpret.
So, we first apply step to the original training set.
regLogAll=glm(outcome~.,data=data.train,family=binomial)
regLogStepOriginal=step(regLogAll,trace=0)
We are now going to check if the data set contains high-leverage points.
library(car)
meas.inf = influence.measures(regLogStepOriginal)$infmat
index=1:360
s <- summary(influence.measures(regLogStepOriginal),print=F)[,"cook.d"]
Figure 2.5: Cook’s distance of observations in the original training set.
We can see in figure 2.5 that observation No. 296 exerts a very high influence on our regression model. High-leverage points show unusual values in the predictors. Including such observations in our training set is crucial, as by doing so our model is able to detect also these uncommon points and to classify them correctly. We want to remark that all points identified as outcasts by RSLS are high leverage points.
library(arsenal)
comparedf(data.train[as.numeric(names(s)),],outcast)
Compare Object
Function Call:
comparedf(x = data.train[as.numeric(names(s)), ], y = outcast)
Shared: 19 non-by variables and 8 observations.
Not shared: 0 variables and 47 observations.
Differences found in 18/18 variables compared.
1 variables compared have non-identical attributes.
For this reasons, a binomial GLM trained on RSLS without outcasts cannot lead to good results. Furthermore, if we try to apply step function to SMOTE training data, as shown in figure 2.6, the number of high leverage points increase and many of them are synthetical data. As a result, our model would heavily rely on artificially generated points. For this reason, we prefer not to apply logistic regression on data on which we performed resampling techniques.
regLogAll=glm(outcome~.,data=SMOTEtrain,family=binomial)
regLogStepSMOTE=step(regLogAll,trace=0)
library(car)
meas.inf = influence.measures(regLogStepSMOTE)$infmat
index=1:nrow(SMOTEtrain)
s <- summary(influence.measures(regLogStepSMOTE),print=F)[,"cook.d"]
Figure 2.6: Cook’s distance of observations in the SMOTE training set.
As reported in the following summary, the model trained on the original data set returns a partially positive feedback on the qualitative analysis we previously did by looking at the histograms and boxplots reported in the introduction (1.3):
summary(regLogStepOriginal)
Call:
glm(formula = outcome ~ vconst_corr + vconst_2 + vconst_4 + vconst_5 +
vconst_7 + vertical_decay_scale + convect_corr + bckgrnd_vdc1 +
bckgrnd_vdc_eq + bckgrnd_vdc_psim, family = binomial, data = data.train)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.8141 0.0018 0.0175 0.1180 1.2011
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 13.329 2.823 4.721 2.35e-06 ***
vconst_corr -11.566 2.534 -4.565 5.00e-06 ***
vconst_2 -10.971 2.343 -4.682 2.84e-06 ***
vconst_4 1.975 1.219 1.620 0.105248
vconst_5 3.633 1.425 2.550 0.010762 *
vconst_7 2.323 1.174 1.980 0.047757 *
vertical_decay_scale -2.910 1.285 -2.264 0.023581 *
convect_corr -5.093 1.499 -3.397 0.000682 ***
bckgrnd_vdc1 4.719 1.575 2.995 0.002740 **
bckgrnd_vdc_eq 3.621 1.293 2.801 0.005096 **
bckgrnd_vdc_psim 2.780 1.278 2.175 0.029651 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 215.971 on 359 degrees of freedom
Residual deviance: 69.912 on 349 degrees of freedom
AIC: 91.912
Number of Fisher Scoring iterations: 9
Specifically, it confirms that failures are correlated to relatively high values of the predictors vconst_corr and vconst_2.
Now, classification with logistic regression is performed by fixing a probability threshold \(s\) and then applying the following rule:
\[ \hat{C}(x) = \begin{cases} 1, & \hat{p}(x) > s \\ 0, & \hat{p}(x) \leq s \end{cases} \]
In order to find best trade-off between precision and recall, we are going to select the classification threshold that returns us the most satisfactory results in terms of the F1-measure via cross-validation. The final estimate of the generic measure obtained after this process will be of the form:
\[ \text{Measure}_{CV(K)}=\sum_{k=1}^K \frac{n_k}{n}\text{Measure}_k \]
where \(K\) is the number of folds, \(n\) the number of records, \(n_k\) the number of records in th \(k\)-th fold and \(\text{Measure}_k\) is the value taken by the measure on that fold.
Since the number of observation in our data set is pretty low, we need to apply cross-validation with many folds in order to obtain reliable estimates. For this reason, we are going to use a \(20\)-fold cross-validation, repeated \(10\) times (in order to decrease the dependence of the data partition from the initializing seed). Therefore, we are going to fit a logistic model using the step function at each iteration.
Notice that, from now on, all the ranges from which we will extract hyperparameters optimal values result from a repeated refinement process.
During our experiments we observed that sometimes we were not able to compute the above mentioned metrics. Specifically, this happened when the true positives and one between false positives and false negatives were equal to zero.To overcome this problem we decided to:
assign value 1 to precision, recall and F1-measure when true positives, false positives and false negatives are all 0;
assign value 0 to precision, recall and F1-measure if true positives are 0 and one of the two other counters is larger than 0.
Regarding the first case, we highlight that we perform stratified sampling. Nevertheless, in rare cases the validation fold does not contain any positive observation. This problem might be tackled also by decreasing the number of folds in which we divide the data set, but by doing this we would end with a training set too small.
In the same fashion we decided to apply these corrections to all the other algorithms we developed.
library(splitTools)
library(caret)
repeats=10
nfolds=20
threshold=seq(0.05,0.95,by=0.05)
meas.matrix <- array(0,dim=c(5,length(threshold)),dimnames=list(c("Accuracy","Precision","Sensitivity","Specificity","F1"), threshold))
mm <- array(0,dim=c(5,length(threshold),repeats*nfolds))
# We start by making repeated, stratified cross-validation folds
set.seed(3)
folds <- create_folds(data.train$outcome, k = nfolds, m_rep = repeats)
count=1
for(fold in folds){
reg_LogAll=glm(outcome~.,data=data.train[fold,],family=binomial)
reg_LogStep=step(reg_LogAll,trace=0)
prob=predict(reg_LogStep,newdata=data.train[-fold,],type="response")
ss=0
for(s in threshold){
ss=ss+1
newpred = as.factor(ifelse(prob> s, "success", "failure"))
cm=confusionMatrix(newpred,data.train[-fold,]$outcome,positive="failure")
mm[1,ss,count]=cm$overall[1]; # Accuracy and Specificity are never NA or NaN
mm[4,ss,count]=cm$byClass[2];
tt=as.numeric(cm$table)
if(tt[1]==0 & tt[2]==0 & tt[3]==0){
mm[2,ss,count]=1;
mm[3,ss,count]=1;
mm[5,ss,count]=1;
}
else if(tt[1]==0 & (tt[2]!=0 | tt[3]!=0)){
mm[2,ss,count]=0;
mm[3,ss,count]=0;
mm[5,ss,count]=0;
}
else{
mm[2,ss,count]=cm$byClass[5];
mm[3,ss,count]=cm$byClass[1];
mm[5,ss,count]=cm$byClass[7];
}
}
count=count+1
}
for(i in 1:5){
for(ss in 1:length(threshold)){
meas.matrix[i,ss]=mean(mm[i,ss,])
}
}
The figure below shows how performance measures vary in function of probability threshold.
Figure 2.7: Perfomance measures as functions of probability threshold.
## choose best threshold
s.glm=NA
f1max = 0
for(s in 1:length(threshold)){
f1 <- meas.matrix["F1",s]
if(f1max <f1){
f1max = f1
s.glm=s
}
}
cat('Best threshold:',threshold[s.glm],'for glm.')
Best threshold: 0.65 for glm.
We now proceed to set the classification threshold to be the optimal value that we computed in the model trained above on the entire training set and to perform classification on the test set.
probs=predict(regLogStepOriginal,newdata=data.test,type="response")
pred.glm=as.factor(ifelse(probs>threshold[s.glm],"success","failure"))
cm=confusionMatrix(pred.glm,as.factor(data.test$outcome))
draw_confusion_matrix(cm)
Figure 2.8: Confusion matrix of the logistic regression model.
final.meas=matrix(0,13,3)
final.meas[1,1]=cm$byClass[[7]]
final.meas[1,2]=cm$byClass[[1]]
final.meas[1,3]=cm$byClass[[5]]
\(k\)-nearest neighbors algorithm (\(k\)-NN) is a non-parametric classification method, which estimate the conditional distribution of \(Y\) given \(X\), and then classify a given observation to the class with highest estimated probability.
Given \(k\in \mathbb{N}^+\) and a test observation \(x_0\), the \(k\)-NN classifier first identifies the set of the \(k\) points in the training data that are closest to \(x_0\), i.e. the set defined as \[ \mathcal{N}_k(x_0)=\{x_{i_1},\dots,x_{i_k} | d(x_0,x_h)>d(x,x_{_{i_j}})\ \forall j=1,\dots,k\ \forall h=1,\dots,n: h\neq i_1,\dots,i_k\}, \] then it estimates the conditional probability for class \(j\) as the fraction of points in \(\mathcal{N}_k(x_0)\) whose response values equal \(j\): \[ P(Y=j|X=x_0)=\frac{1}{K}\sum_{i:x_i \in \mathcal{N}_k(x_0)}{I(y_i=j)}, \] where \(I\) is the indicator function. Finally, classification is performed resorting to the Bayesian rule.
\(k\)-NN algorithm is based on distances between points (euclidean distance is used in our case). So, it is very important to normalize data, before applying \(k\)-NN. For example, given two features \(x\in [0,1]\) and \(y\in[10000,20000]\) and three points \(A=(x=0.5,y=12000)\), \(B=(0.1,12000)\) and \(D=(0.5,12500)\), if \(k=1\) point \(A\) will be classified like \(B\) even if \(D\) is the closest one to it in space. Normalizing removes this problem. In our cases, data are already normalized.
\(k\)-NN method suffers for curse of dimensionality. For example, consider the unit cube \([0,1]^d\) and suppose that training data are uniformly distributed in this hypercube. Let \(\ell\) be the edge length of the smallest hyper-cube that contains all \(k\)-nearest neighbor of a test point. Then \(\ell \approx \left(\frac{k}{n} \right)^{\frac{1}{d}}\) (https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote02_kNN.html). The bigger \(d\), the more \(\ell \to 1\) and the \(k\)-NN are not so nearby.
In order to reduce dimensionality, we can apply PCA - Principal Component Analysis, which is an unsupervised approach which allow us to summarize the set of features in a smaller number of representative variables that collectively explain most of the variability in the original data set. It is also a tool for data visualization. PCA problem can be formulated as follows. The first principal component of a set of features \(X_1,X_2,\cdots,X_d\) is the linear combination of the features
\[ Z_1 = \alpha_{11}X_1 +\alpha_{12}X_2 + \dots+ \alpha_{1d}X_d\ \ \ t.c. \ \ \sum_{j=1}^d \alpha_{1j}^2=1 \]
that has the largest variance. The elements \(\alpha_{11},\cdots,\alpha_{1d}\in \mathbb{R}\) are called loading factors of the first principal component and they must be chosen so that they correspond to the solution of the following problem:
\[ \max_{\ \ \ \ \ \ \ \ \alpha_1\in\mathbb{R}^d\\ s.t. \sum_{i=1}^d\alpha_{1i}^2=1} Var\left(\sum_{i=1}^d \alpha_{1i}X_i\right), \] i.e. we want to preserve direction that maximizes variance. Suppose we have a \(n\times d\) data set \(X=\begin{bmatrix}x_1^T\\ \vdots \\ x_n^T \end{bmatrix}\in\mathbb{R}\), whose rows are realizations of \((X_1,\dots,X_d)\) and we assume data are centered. We can compute the loading factors of first principal component solving \[ \max_{\ \ \ \ \ \ \alpha_{1i},\dots,\alpha_{1d}\\ s.t.\ \sum_{j=1}^d \ \alpha_{1j}^2=1}\frac{1}{n}\sum_{i=1}^n \left(\sum_{j=1}^d\alpha_{1j}x_{ij}\right)^2 \]
This problem can be solved via singular-value decomposition of the matrix \(X\), a standard technique in linear algebra. Then, we can find the second principal component as
\[ Z_2 = \alpha_{21}X_1 +\alpha_{22}X_2 + \dots+ \alpha_{2d}X_d\ \ \ t.c. \ \ \sum_{j=1}^d \alpha_{2j}^2=1 \] in such a way that \(Z_1,Z_2\) are uncorrelated. The process continues in this way until the cumulative PVE - Proportion of Variance Explained is high enough. For understanding how much is “high enough” one can eyeball the scree plot (PVE versus number of principal component) and look for an elbow.
Figure 2.9 shows that each variable explains about 5-6% of total variability and there are no elbows in scree plot. This does not surprise us, because all features in our data set are uncorrelated, as pointed out in the introduction.
dataPCA=climate[,3:20]
mx=apply(climate[,3:20],MARGIN=2,mean)
PCA = prcomp(dataPCA, scale. = F,center =mx)
library(plotly)
a <- list(
autotick = FALSE,
ticks = "outside",
tick0 = 0,
dtick = 0.25,
ticklen = 5,
tickwidth = 2,
tickcolor = toRGB("blue")
)
s <- seq(0,1,by=0.1)
fig1 <- plot_ly(y = ~s,type = 'scatter',mode="none")
fig1 <- fig1 %>% add_trace(x=~1:18,y = ~summary(PCA)$importance["Proportion of Variance",], type = 'scatter', mode = 'lines+markers',name="PVE")
fig2 <- plot_ly(x=1:18,y = ~summary(PCA)$importance["Cumulative Proportion",], type = 'scatter', mode = 'lines+markers',name="Cumulative PVE")
fig=subplot(fig1,fig2)
fig <- fig %>% layout(
yaxis = list(
title = "Proportion of Variance explained"
),
xaxis=list(domain=c(0,0.45),title="Principal component"),
yaxis2 = list(
title = "Cumulative Proportion of Variance explained"
),
xaxis2=list(domain=c(0.55,1),title="Principal component"),
showlegend= FALSE
)
fig
Figure 2.9: Scree plot and Cumulative PVE.
However, we decided to apply \(k\)-NN on the subset of variables previously identified through boxplots (1.3) (namely vconst_corr, vconst_2, convect_corr and bckgrnd_vdc1), in order to reduce dimensionality.
In the following, we first apply \(k\)-NN on the reduced original data set by choosing the best \(k\) through 20-CV repeated 10 times. Secondly, we apply \(k\)-NN on the reduced SMOTE data set, again selecting the best \(k\) by doing 20-CV repeated 10 times on the original data set, but this time applying SMOTE on training folds. The same thing is done for RSLS without outcasts. About the latter case, as mentioned before, we intersect results obtained applying the \(k\)-NN model trained on RSLS without outcasts with the ones of \(1\)-NN applied on the data set consisting of outcasts and original negative instances.
Furthermore, in order to take into account for imbalance data, we modified the algorithm in order to choose the best probability threshold, i.e. the one for which the F1-measure is maximized. Exceptions in precision,recall and F1-measure will be handled as explained above.
The best value for \(k\) and threshold in the three cases mentioned above are:
data.train1=data.train[,c(1,2,13,14,19)]
data.test1=data.test[,c(1,2,13,14,19)]
repeats=10
nfolds=20
threshold=seq(0.1,0.9,by=0.1)
k_test=seq(1,51,by=2)
meas.matrix<-array(0,dim=c(5,length(threshold),length(k_test)),
dimnames=list(c("Accuracy","Precision","Sensitivity",
"Specificity","F1"),threshold,k_test))
mm=matrix(0,5,nfolds*repeats)
# We start by making repeated, stratified cross-validation folds
ss=0
set.seed(3)
folds <- create_folds(data.train1$outcome, k = nfolds, m_rep = repeats)
for(s in threshold){
ss=ss+1
kk=0
for(k in k_test){
kk=kk+1
count=1
for(fold in folds){
# SMOTE sampling
# SMOTE_train <- SMOTE(outcome ~ ., data = data.train1[fold,])
# pred=knn(SMOTE_train[,1:4], data.train1[-fold,1:4],SMOTE_train[fold,5], k = k)
pred=knn3Train(train=data.train1[fold,1:4], test=data.train1[-fold,1:4],
cl=data.train1[fold,5], k = k,prob=TRUE)
prob=attr(pred,"prob")
newpred = as.factor(ifelse(prob[,"failure"] > s, "failure", "success"))
levels(newpred) <- c("failure","success")
cm=confusionMatrix(newpred,data.train1[-fold,]$outcome,positive="failure")
mm[1,count]=cm$overall[1]; # Accuracy and Specificity are never NA or NaN
mm[4,count]=cm$byClass[2];
tt=as.numeric(cm$table)
if(tt[1]==0 & tt[2]==0 & tt[3]==0){
mm[2,count]=1;
mm[3,count]=1;
mm[5,count]=1;
}
else if(tt[1]==0 & (tt[2]!=0 | tt[3]!=0)){
mm[2,count]=0;
mm[3,count]=0;
mm[5,count]=0;
}
else{
mm[2,count]=cm$byClass[5];
mm[3,count]=cm$byClass[1];
mm[5,count]=cm$byClass[7];
}
count=count+1
}
meas.matrix[,ss,kk]=apply(mm,MARGIN=1,mean)
}
}
meas.matrix
## choose best k and threshold
k.original=NA
s.original=NA
f1max = 0
for(s in 1:length(threshold)){
for(k in 1:length(k_test)){
f1 <- meas.matrix["F1",s,k]
if(f1max <f1){
f1max = f1
k.original=k
s.original=s
}
}
}
k.original
s.original
# train model
k.original=20 #k=39
s.original=2 #s=0.2
threshold=seq(0.1,0.9,by=0.1)
k_test=seq(1,51,by=2)
cat('Best k:',k_test[k.original],', best threshold:',threshold[s.original],'for original.')
Best k: 39 , best threshold: 0.2 for original.
meas.matrix.SMOTE <- array(0,dim=c(5,length(threshold),length(k_test)),dimnames=list(c("Accuracy","Precision","Sensitivity","Specificity","F1"), threshold,k_test))
mm=matrix(0,5,nfolds*repeats)
# We start by making repeated, stratified cross-validation folds
ss=0
set.seed(3)
folds <- create_folds(data.train1$outcome, k = nfolds, m_rep = repeats)
for(s in threshold){
ss=ss+1
kk=0
for(k in k_test){
kk=kk+1
count=1
for(fold in folds){
# SMOTE sampling
SMOTE_train<-smotefamily::SMOTE(data.train1[fold,1:4],data.train1[fold,5],K=5,dup_size = 0)
SMOTE_train=SMOTE_train$data
pred=knn3Train(train=SMOTE_train[,1:4], test=data.train1[-fold,1:4],
cl=SMOTE_train[,5],k=k,prob=TRUE)
#pred=knn3Train(train=data.train1[fold,1:4], test=data.train1[-fold,1:4],cl=data.train1[fold,5], k = k,prob=TRUE)
prob=attr(pred,"prob")
newpred = as.factor(ifelse(prob[,"failure"] > s, "failure", "success"))
levels(newpred) <- c("failure","success")
cm=confusionMatrix(newpred,data.train1[-fold,]$outcome,positive="failure")
mm[1,count]=cm$overall[1]; # Accuracy and Specificity are never NA or NaN
mm[4,count]=cm$byClass[2];
tt=as.numeric(cm$table)
if(tt[1]==0 & tt[2]==0 & tt[3]==0){
print("1")
print(tt)
mm[2,count]=1;
mm[3,count]=1;
mm[5,count]=1;
}
else if(tt[1]==0 & (tt[2]!=0 | tt[3]!=0)){
mm[2,count]=0;
mm[3,count]=0;
mm[5,count]=0;
}
else{
mm[2,count]=cm$byClass[5];
mm[3,count]=cm$byClass[1];
mm[5,count]=cm$byClass[7];
}
count=count+1
}
meas.matrix.SMOTE[,ss,kk]=apply(mm,MARGIN=1,mean)
}
}
meas.matrix.SMOTE
## choose best k and threshold
k.SMOTE=NA
s.SMOTE=NA
f1max = 0
for(s in 1:length(threshold)){
for(k in 1:length(k_test)){
f1 <- meas.matrix.SMOTE["F1",s,k]
if(f1max <f1){
f1max = f1
k.SMOTE=k
s.SMOTE=s
}
}
}
k.SMOTE
s.SMOTE
# train model
k.SMOTE=14 #k=27
s.SMOTE=8 #s=0.8
cat('Best k:',k_test[k.SMOTE],', best threshold:',threshold[s.SMOTE],'for SMOTE.')
Best k: 27 , best threshold: 0.8 for SMOTE.
repeats=10
nfolds=20
threshold=seq(0.1,0.9,by=0.1)
k_test=seq(1,51,by=2)
data.train2=anti_join(data.train,outcast)
meas.matrix<-array(0,dim=c(5,length(threshold),length(k_test)),
dimnames=list(c("Accuracy","Precision","Sensitivity",
"Specificity","F1"),threshold,k_test))
set.seed(3)
folds <- create_folds(data.train2$outcome, k = nfolds, m_rep = repeats)
for(fold in folds){
SMOTE_model<-smotefamily::RSLS(data.train2[fold,c(1,2,13,14)],data.train2[fold,19],K=5,C=5,dupSize=0)
RSLSWO_train=SMOTE_model$data
colnames(RSLSWO_train)[5]="outcome"
RSLSWO_train$outcome=as.factor(RSLSWO_train$outcome)
levels(RSLSWO_train$outcome)=c("failure","success")
kk=0
for(k in k_test){
kk=kk+1
pred=knn3Train(train=RSLSWO_train[,1:4],test=data.train2[-fold,c(1,2,13,14)],
cl=RSLSWO_train[,5],k=k,prob=TRUE)
prob=attr(pred,"prob")
ss=0
for(s in threshold){
ss=ss+1
newpred = as.factor(ifelse(prob[,"failure"] > s, "failure", "success"))
levels(newpred) <- c("failure","success")
cm=confusionMatrix(newpred,data.train2[-fold,]$outcome,positive="failure")
meas.matrix[1,ss,kk]=meas.matrix[1,ss,kk]+cm$overall[1]; # Accuracy and Specificity are never NA or NaN
meas.matrix[4,ss,kk]=meas.matrix[4,ss,kk]+cm$byClass[2];
tt=as.numeric(cm$table)
if(tt[1]==0 & tt[2]==0 & tt[3]==0){
meas.matrix[2,ss,kk]=meas.matrix[2,ss,kk]+1;
meas.matrix[3,ss,kk]=meas.matrix[3,ss,kk]+1;
meas.matrix[5,ss,kk]=meas.matrix[5,ss,kk]+1;
}
else if(tt[1]==0 & (tt[2]!=0 | tt[3]!=0)){
meas.matrix[2,ss,kk]=meas.matrix[2,ss,kk];
meas.matrix[3,ss,kk]=meas.matrix[3,ss,kk];
meas.matrix[5,ss,kk]=meas.matrix[5,ss,kk];
}
else{
meas.matrix[2,ss,kk]=meas.matrix[2,ss,kk]+cm$byClass[5];
meas.matrix[3,ss,kk]=meas.matrix[3,ss,kk]+cm$byClass[1];
meas.matrix[5,ss,kk]=meas.matrix[5,ss,kk]+cm$byClass[7];
}
}
}
}
meas.matrix/(nfolds*repeats)
## choose best k and threshold
k.RSLS=NA
s.RSLS=NA
f1max = 0
for(s in 1:length(threshold)){
for(k in 1:length(k_test)){
f1 <- meas.matrix["F1",s,k]
if(f1max <f1){
f1max = f1
k.RSLS=k
s.RSLS=s
}
}
}
k.RSLS
s.RSLS
# train model
k.RSLS=2 #9
s.RSLS=5 #7
cat('Best k:',k_test[k.RSLS],', best threshold:',threshold[s.RSLS],'for RSLS without outcasts.')
Best k: 3 , best threshold: 0.5 for RSLS without outcasts.
Now, we intersect results from \(3\)-NN trained on RSLS training set without outcasts with \(1\)-NN trained on data set composed only by outcasts and original negative instances.
The results are shown in the figure below.
library(caret)
data.train1=data.train[,c(1,2,13,14,19)]
data.test1=data.test[,c(1,2,13,14,19)]
knn.original=knn3Train(train=data.train1[,1:4],test=data.test1[,1:4],cl=data.train1[,5],
k=k_test[k.original],prob=TRUE)
prob.knn.original=attr(knn.original,"prob")
newpred=as.factor(ifelse(prob.knn.original[,"failure"] > threshold[s.original], "failure", "success"))
levels(newpred) <- c("failure","success")
cm=confusionMatrix(newpred,data.test1$outcome,positive="failure")
draw_confusion_matrix(cm)
final.meas[2,1]=cm$byClass[[7]]
final.meas[2,2]=cm$byClass[[1]]
final.meas[2,3]=cm$byClass[[5]]
knn.SMOTE=knn3Train(train=SMOTEtrain[,c(1,2,13,14)], test=data.test1[,1:4],
cl=SMOTEtrain[,19],k=k_test[k.SMOTE],prob=TRUE)
prob.knn.SMOTE=attr(knn.SMOTE,"prob")
newpred = as.factor(ifelse(prob.knn.SMOTE[,"failure"] > threshold[s.SMOTE], "failure", "success"))
levels(newpred) <- c("failure","success")
cm=confusionMatrix(newpred,data.test1$outcome,positive="failure")
draw_confusion_matrix(cm)
knn.RSLS=knn3Train(train=RSLSWOtrain[,c(1,2,13,14)],test=data.test1[,1:4],
cl=RSLSWOtrain[,19],k=k_test[k.RSLS],prob=TRUE)
prob.knn.RSLS=attr(knn.RSLS,"prob")
newpred=as.factor(ifelse(prob.knn.RSLS[,"failure"] > threshold[s.RSLS], "failure", "success"))
levels(newpred) <- c("failure","success")
cm=confusionMatrix(newpred,data.test1$outcome,positive="failure")
draw_confusion_matrix(cm)
final.meas[3,1]=cm$byClass[[7]]
final.meas[3,2]=cm$byClass[[1]]
final.meas[3,3]=cm$byClass[[5]]
pred.knn1=knn3Train(train=negOutTrain[,c(1,2,13,14)],test=data.test[,c(1,2,13,14)],cl=negOutTrain$outcome,k=1,prob=TRUE)
prob.acc=attr(pred.knn1,"prob")
prob.knn.RSLSho=prob.knn.RSLS
prob.knn.RSLSho[pred.knn1=="failure",]=prob.acc[pred.knn1=="failure",]
newpred[pred.knn1=="failure"]="failure"
cm=confusionMatrix(newpred,data.test1$outcome,positive="failure")
draw_confusion_matrix(cm)
final.meas[4,1]=cm$byClass[[7]]
final.meas[4,2]=cm$byClass[[1]]
final.meas[4,3]=cm$byClass[[5]]
Figure 2.10: Confusion matrix for k-NN training on the original data set (Topleft), SMOTE training set (Topright), RSLS training set without outcasts (Bottomleft) and RSLS handling outcasts training data set (Bottomright).
Within binary classification, SVM represent a class of algorithms that try to perform classification by resorting to an appropriate separating surface in the feature space. The idea which they are based on is rather intuitive: given a \(d\)-dimensional feature space, they aim to identify a \((d-1)\)-dimensional surface that divides it into two separate regions, each, ideally, associated with one of the two classes.
The simplest representative of this class of models is called Hard-SVM or maximal margin classifier. This classifier takes decisions on the examples by finding the separating hyperplane in the feature space that maximizes the margin, i.e. the minimum distance between records and itself, and then assigning each of them to the class that corresponds to the halfspace where it is placed in. Briefly, by indicating a generic hyperplane with \((\mathbb{w},b)\) (where \(\mathbb{w}\) is normal vector of the hyperplane), the Hard-SVM rule corresponds to the following quadratic optimization problem:
\[ \arg\min_{\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (\mathbb{w},b)\in\mathbb{R}^{d+1} \\ \text{s.t.}\ y_i(\langle\mathbb{w},x_i\rangle+b)\geq 1, i=1,\dots,n} \frac{1}{2} \| \mathbb{w} \|^2 \] \((x_i,y_i)\) is correctly classified when \(y_i\left(\langle\mathbb{w},x_i\rangle+b\right)\geq 1\).
However, in many cases a separating hyperplane does not exist, so we cannot apply Hard-SVM. This is the case of a non-linearly separable feature space. This problem can be tackled in two ways:
we can resort to Soft-SVM;
if the previous alternative still returns poor results, we can exploit kernel SVM methods.
Soft-SVM is very similar to the hard version, except for the fact that we allow some misclassifications. The hyperplane used in Soft-SVM corresponds to the solution of the following optimization problem:
\[ \arg\min_{\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (\mathbb{w},b)\in\mathbb{R}^{d+1},\ \epsilon\in\mathbb{R}^n \\ \text{s.t.}\ y_i\left(\langle\mathbb{w},x_i\rangle+b\right)>1-\epsilon_i,\ i=1,\dots,n\\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \epsilon_i\geq 0,\ i=1,\dots,n } \left(\frac{1}{2} \| \mathbb{w} \|^2+C\sum_{i=1}^n \epsilon_i\right) \]
Here, \(\epsilon_1,\ldots,\epsilon_n\) are slack variables that allow individual margin violations:
if \(\epsilon_i=0\), then the \(i\)-th observation is placed in the correct halfspace and does not cross the margin boundary;
if \(0<\epsilon_i<1\), then the \(i\)-th observation oversteps the margin;
if \(\epsilon_i>1\), then the observation is on the wrong side of the hyperplane.
The second term of the objective function attempts to minimize the penalty term associated with misclassifications. In particular, the regularization parameter \(C\) can be considered as the assigned misclassification cost. Notice that slack variables \(\epsilon_1,\ldots,\epsilon_n\) and the parameter \(C\) are strictly related, in that when \(C\) is high, the model allows for less margin violations. On the contrary, as \(C\) decreases, the number of violations rises. Basically, \(C\) controls the bias-variance trade-off of Soft-SVM: for large values of \(C\), margins are tighter and the classifier highly fits the data, while for small values the models fits the data less. In the first case, we have a classifier with a low bias and a high variance. On the other hand, in the second case, our classifier is potentially more biased but it has a lower variance. In other words, \(C\) controls the trade-off between the goal of maximizing the margin and the one of making the least misclassifications possible.
The solution to this problem can be found by first solving its dual Lagrangian problem:
\[ \arg\max _{\ \ \ \ \ \ \ \ \ \ \ \ \ \ \alpha\in\mathbb{R}^n \\ \ \ \,\,\text{s.t.}\ \sum_{i=1}^{n} y_{i} \alpha_{i}=0, \quad \\ \ \ 0 \leq \alpha_{i} \leq C,\ i=1, \dots, n} \left(\sum_{i=1}^{n} \alpha_{i}-\frac{1}{2} \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_{i} \alpha_{j} y_{i} y_{j} \left\langle x_{i}, x_{j}\right\rangle \right), \]
where \(\alpha_i\) are the Lagrange multipliers, and then retrieving \(\mathbb{w}\) by applying Frizt John Theorem and using the following formula:
\[ \mathbb{w}=\sum_{i=1}^{l} \alpha_{i} y_{i} x_{i}, \]
where \(l\) is the number of support vectors (the vectors whose span contains the separating hyperplane).
As in most cases, our data set is not linearly separable (as shown in the first column of figure 2.3). For this reason, we decided not to fit Hard-SVM to our data and to go with Soft-SVM right away.
For dealing with imbalanced data, we also decided to implement a cost-sensitive learning solution known as different error cost method (DEC), as suggested in [4]. This method is analogous to standard Soft-SVM, except for the fact that in this case the objective function of the optimization problem takes the following expression:
\[ \frac{1}{2} \|\mathbb{w}\|^2+C^{+} \sum_{i : y_{i}=+1}^{n} \epsilon_{i}+C^{-} \sum_{i : y_{i}=-1}^{n} \epsilon_{i} \]
Basically, we are assigning a different misclassification cost to each class. From now on, we will refer to \(C^+\) and \(C^-\) as weights. By introducing these weights, we aim at increasing the importance of the minority class: stronger penalization of errors should force the classifier training procedure to focus on instances belonging to it.
We tried to apply SMOTE and RSLS handling outcasts to our data set. Our choice is due to the fact that there are several examples about successful applications of these resampling methods in literature ([2],[3]).
We are going to find the optimal values for weights \((C^+,C^-)\) via cross-validation. Unlike what we did for \(k\)-NN, we decided not to select an optimal classification threshold via cross-validation for SVM. The reason is that, by fixing class weights, we already set a value for the probability threshold ([4]):
\[ s=\frac{C^+}{C^++C^-} \]
Hence, finding the optimal class weights implies determining the optimal value for \(s\).
We now perform cross-validation as we did for \(k\)-NN in order to assess the best weights for Soft-SVM with linear kernel for all three training data sets.
library(splitTools)
library(DMwR)
library(e1071)
library(caret)
repeats=10
nfolds=20
prop=c(sum(data.train$outcome=="failure"),sum(data.train$outcome=="success"))
weights=list(c(1,1),c(2,1),c(3,1),c(4,1),c(5,1),c(6,1),c(7,1),c(8,1),c(9,1),c(10,1))
meas.matrix<-array(0,dim=c(5,length(weights)),
dimnames=list(c("Accuracy","Precision","Sensitivity",
"Specificity","F1"),weights))
mm=matrix(0,5,nfolds*repeats)
set.seed(3)
ww=0
folds<-create_folds(data.train$outcome,k=nfolds,m_rep=repeats)
for(w in weights){
ww=ww+1
count=1
for(fold in folds){
model.svm.linear=svm(outcome~.,data=data.train[fold,],method="C-classification",
kernel="linear",probability=T,scale=F,
class.weights=w*prop/table(data.train$outcome))
pred=predict(model.svm.linear,data.train[-fold,],probability=F)
cm=confusionMatrix(pred,data.train[-fold,]$outcome,positive="failure")
mm[1,count]=cm$overall[1]; # Accuracy and Specificity are never NA or NaN
mm[4,count]=cm$byClass[2];
tt=as.numeric(cm$table)
if(tt[1]==0 & tt[2]==0 & tt[3]==0){
mm[2,count]=1;
mm[3,count]=1;
mm[5,count]=1;
}
else if(tt[1]==0 & (tt[2]!=0 | tt[3]!=0)){
mm[2,count]=0;
mm[3,count]=0;
mm[5,count]=0;
}
else{
mm[2,count]=cm$byClass[5];
mm[3,count]=cm$byClass[1];
mm[5,count]=cm$byClass[7];
}
count=count+1
}
meas.matrix[,ww]=apply(mm,MARGIN=1,mean)
}
meas.matrix
# Choosing the best hyperparameters
w.linear=NA
f1max=0
for(w in 1:length(weights)){
f1=meas.matrix["F1",w]
if(f1>f1max){
f1max=f1
w.linear=w
}
}
library(splitTools)
library(DMwR)
library(e1071)
library(caret)
weights=list(c(1,1),c(2,1),c(3,1),c(4,1),c(5,1),c(6,1),c(7,1),c(8,1),c(9,1),c(10,1))
w.linear=2
cat('Best weight for positive class:',weights[[w.linear]][1],', best threshold:', weights[[w.linear]][1]/(weights[[w.linear]][1]+weights[[w.linear]][2]),'for original.')
Best weight for positive class: 2 , best threshold: 0.6666667 for original.
prop=c(sum(data.train$outcome=="failure"),sum(data.train$outcome=="success"))
weights=list(c(1,1),c(2,1),c(3,1),c(4,1),c(5,1),c(6,1),c(7,1),c(8,1),c(9,1),c(10,1))
meas.matrix<-array(0,dim=c(5,length(weights)),
dimnames=list(c("Accuracy","Precision","Sensitivity",
"Specificity","F1"),weights))
mm=matrix(0,5,nfolds*repeats)
set.seed(3)
ww=0
folds<-create_folds(data.train$outcome,k=nfolds,m_rep=repeats)
for(w in weights){
ww=ww+1
count=1
for(fold in folds){
# SMOTE sampling
SMOTE_train<-smotefamily::SMOTE(data.train[fold,1:18],data.train[fold,19],K=5,dup_size=0)
SMOTE_train<-SMOTE_train$data
colnames(SMOTE_train)[19]="outcome"
SMOTE_train$outcome=as.factor(SMOTE_train$outcome)
model.svm.linear=svm(outcome~.,data=SMOTE_train,method="C-classification",
kernel="linear",probability=T,scale=F,
class.weights=w*prop/table(data.train$outcome))
pred=predict(model.svm.linear,data.train[-fold,],probability=F)
cm=confusionMatrix(pred,data.train[-fold,]$outcome,positive="failure")
mm[1,count]=cm$overall[1]; # Accuracy and Specificity are never NA or NaN
mm[4,count]=cm$byClass[2];
tt=as.numeric(cm$table)
if(tt[1]==0 & tt[2]==0 & tt[3]==0){
mm[2,count]=1;
mm[3,count]=1;
mm[5,count]=1;
}
else if(tt[1]==0 & (tt[2]!=0 | tt[3]!=0)){
mm[2,count]=0;
mm[3,count]=0;
mm[5,count]=0;
}
else{
mm[2,count]=cm$byClass[5];
mm[3,count]=cm$byClass[1];
mm[5,count]=cm$byClass[7];
}
count=count+1
}
meas.matrix[,ww]=apply(mm,MARGIN=1,mean)
}
meas.matrix
# Choosing the best hyperparameters
w.linear.SMOTE=NA
f1max=0
for(w in 1:length(weights)){
f1=meas.matrix["F1",w]
if(f1>f1max){
f1max=f1
w.linear.SMOTE=w
}
}
w.linear.smote
weights=list(c(1,1),c(2,1),c(3,1),c(4,1),c(5,1),c(6,1),c(7,1),c(8,1),c(9,1),c(10,1))
w.linear.SMOTE=1
cat('Best weight for positive class:',weights[[w.linear.SMOTE]][1],', best threshold:',weights[[w.linear.SMOTE]][1]/(weights[[w.linear.SMOTE]][1]+weights[[w.linear.SMOTE]][2]),'for SMOTE.')
Best weight for positive class: 1 , best threshold: 0.5 for SMOTE.
data.train2=anti_join(data.train,outcast)
prop=c(sum(data.train$outcome=="failure"),sum(data.train$outcome=="success"))
weights=list(c(1,1),c(2,1),c(3,1),c(4,1),c(5,1),c(6,1),c(7,1),c(8,1),c(9,1),c(10,1))
mm<-array(0,dim=c(5,length(weights)),
dimnames=list(c("Accuracy","Precision","Sensitivity",
"Specificity","F1"),weights))
set.seed(3)
folds<-create_folds(data.train$outcome,k=nfolds,m_rep=repeats)
for(fold in folds){
SMOTE_model<-smotefamily::RSLS(data.train2[fold,1:18],data.train2[fold,19],K=5,C=5,dupSize=0)
RSLSWO_train=SMOTE_model$data
colnames(RSLSWO_train)[19]="outcome"
RSLSWO_train$outcome=as.factor(RSLSWO_train$outcome)
levels(RSLSWO_train$outcome)=c("failure","success")
count=0
for(w in weights){
count=count+1
model.svm.linear.RSLS=svm(outcome~.,data=RSLSWO_train,method="C-classification",
kernel="linear",probability=T,scale=F,
class.weights=w*prop/table(data.train2$outcome))
pred=predict(model.svm.linear.RSLS,data.train2[-fold,],probability=F)
cm=confusionMatrix(pred,data.train2[-fold,]$outcome,positive="failure")
mm[1,count]=mm[1,count]+cm$overall[1]; # Accuracy and Specificity are never NA or NaN
mm[4,count]=mm[4,count]+cm$byClass[2];
tt=as.numeric(cm$table)
if(tt[1]==0 & tt[2]==0 & tt[3]==0){
mm[2,count]=mm[2,count]+1;
mm[3,count]=mm[3,count]+1;
mm[5,count]=mm[5,count]+1;
}
else if(tt[1]==0 & (tt[2]!=0 | tt[3]!=0)){
mm[2,count]=mm[2,count];
mm[3,count]=mm[3,count];
mm[5,count]=mm[5,count];
}
else{
mm[2,count]=mm[2,count]+cm$byClass[5];
mm[3,count]=mm[3,count]+cm$byClass[1];
mm[5,count]=mm[5,count]+cm$byClass[7];
}
}
}
mm=mm/nfolds*repeats
# Choosing the best hyperparameters
w.linear.RSLS=NA
f1max=0
for(w in 1:length(weights)){
f1=mm["F1",w]
if(f1>f1max){
f1max=f1
w.linear.RSLS=w
}
}
weights=list(c(1,1),c(2,1),c(3,1),c(4,1),c(5,1),c(6,1),c(7,1),c(8,1),c(9,1),c(10,1))
w.linear.RSLS=1
cat('Best weight for positive class:',weights[[w.linear.RSLS]][1],', best threshold:',weights[[w.linear.RSLS]][1]/(weights[[w.linear.RSLS]][1]+weights[[w.linear.RSLS]][2]),'for RSLS without outcasts.')
Best weight for positive class: 1 , best threshold: 0.5 for RSLS without outcasts.
#Train model
weights=list(c(1,1),c(2,1),c(3,1),c(4,1),c(5,1),c(6,1),c(7,1),c(8,1),c(9,1),c(10,1))
w.linear=2
prop=c(sum(data.train$outcome=="failure"),sum(data.train$outcome=="success"))
model.svm.linear=svm(outcome~.,data=data.train,method="C-classification",
kernel="linear",probability=T,scale=F,
class.weights=weights[[w.linear]]*prop/table(data.train$outcome))
pred.svm.linear.original=predict(model.svm.linear,data.test,probability=F)
cm=confusionMatrix(pred.svm.linear.original,data.test$outcome,positive="failure")
draw_confusion_matrix(cm)
final.meas[5,1]=cm$byClass[[7]]
final.meas[5,2]=cm$byClass[[1]]
final.meas[5,3]=cm$byClass[[5]]
weights=list(c(1,1),c(2,1),c(3,1),c(4,1),c(5,1),c(6,1),c(7,1),c(8,1),c(9,1),c(10,1))
prop=c(sum(data.train$outcome=="failure"),sum(data.train$outcome=="success"))
w.linear.SMOTE=1
model.svm.linear.SMOTE=svm(outcome~.,data=SMOTEtrain,method="C-classification",
kernel="linear",probability=T,scale=F,
class.weights=weights[[w.linear.SMOTE]]*prop/table(data.train$outcome))
pred.svm.linear.SMOTE=predict(model.svm.linear.SMOTE,data.test,probability=F)
cm=confusionMatrix(pred.svm.linear.SMOTE,data.test$outcome,positive="failure")
draw_confusion_matrix(cm)
final.meas[6,1]=cm$byClass[[7]]
final.meas[6,2]=cm$byClass[[1]]
final.meas[6,3]=cm$byClass[[5]]
data.train2=anti_join(data.train,outcast)
prop=c(sum(data.train2$outcome=="failure"),sum(data.train2$outcome=="success"))
weights=list(c(1,1),c(2,1),c(3,1),c(4,1),c(5,1),c(6,1),c(7,1),c(8,1),c(9,1),c(10,1))
w.linear.RSLS=1
model.svm.linear.RSLS=svm(outcome~.,data=RSLSWOtrain,method="C-classification",
kernel="linear",probability=T,scale=F,
class.weights=weights[[w.linear.RSLS]]*prop/table(data.train$outcome))
pred.svm.linear.RSLS=predict(model.svm.linear.RSLS,data.test,probability=F)
cm=confusionMatrix(pred.svm.linear.RSLS,data.test$outcome,positive="failure")
draw_confusion_matrix(cm)
pred.knn1=knn3Train(train=negOutTrain[,c(1,2,13,14)],test=data.test[,c(1,2,13,14)],cl=negOutTrain$outcome,k=1)
prob.knn.RSLSho[pred.knn1=="failure",]=prob.acc[pred.knn1=="failure",]
pred.svm.linear.RSLS[pred.knn1=="failure"]="failure"
cm=confusionMatrix(pred.svm.linear.RSLS,data.test$outcome,positive="failure")
draw_confusion_matrix(cm)
final.meas[7,1]=cm$byClass[[7]]
final.meas[7,2]=cm$byClass[[1]]
final.meas[7,3]=cm$byClass[[5]]
Figure 2.11: Confusion matrix for SVM with linear kernel trained on the original training set (Topleft), on the SMOTE training set (Topright), on the RSLS training set without outcasts (Bottomleft) and on RSLS handling outcasts training set (Bottomright).
As we already mentioned, a second option that can be used in order to deal with the issue of non-linearly separable observations is given by kernel SVM methods. The idea on which they are based on is that of mapping data into a higher dimensional feature space where our set of points becomes linearly separable. This new space is where these models take a decision on the label to assign to each observation. Thanks to the so called kernel trick, these methods do not need to know the form of the mapping or the images of our points in the newly defined space, as long as the values of the kernel function over pairs of domain points are known. Kernel functions correspond to the inner products in the higher dimensional space where we want to solve our problem. As the euclidean inner product, they are used to measure similarity between points.
Combining this new concept with Soft-SVM consists in applying the Soft-SVM problem in a higher dimensional feature space. The formulation is almost identical to the one of Soft-SVM:
\[ \arg\min_{\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (\mathbb{w},b)\in\mathbb{R}^{d+1},\ \epsilon\in\mathbb{R}^n \\ \text{s.t.}\ y_i\left(\langle\mathbb{w},\phi(x_i)\rangle+b\right)>1-\epsilon_i,\ i=1,\dots,n\\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \epsilon_i\geq 0,\ i=1,\dots,n\\ } \ \left(\frac{1}{2}\|\mathbb{w}\|^2+C\sum_{i=1}^n \epsilon_i\right) \]
where \(\phi(x)\) is the mapping between the original data sapce and the newly defined one. As in the linear case, we first need to solve the dual Lagrangian problem:
\[ \arg\max_{\ \ \ \ \ \ \ \ \ \ \ \ \ \ \alpha\in\mathbb{R}^n \\ \ \ \,\,\text{s.t.}\ \sum_{i=1}^{n} y_{i} \alpha_{i}=0, \quad \\ \ \ 0 \leq \alpha_{i} \leq C,\ i=1, \dots, n} \left(\sum_{i=1}^{n} \alpha_{i}-\frac{1}{2} \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_{i} \alpha_{j} y_{i} y_{j} K( x_{i}, x_{j}) \right), \]
where \(K(\cdot,\cdot)\) is the kernel function, and then we retrieve \(\mathbb{w}\) as before.
In the following, we test Gaussian, polynomial and sigmoid kernels. The Gaussian kernel has the following expression:
\[ K_G\left(x,x^\prime\right)=\exp\left(-\gamma\|x-x^\prime\|\right) \\ \]
\(\gamma\) is a positive hyperparameter that controls the scale determining the meaning of “close”. This kernel performs an inner product in an infinite dimensional space. Its value is close to \(0\) when \(x\) and \(x\prime\) are distant one from each other in the new feature space, while it is close to \(1\) when they are near.
The polynomial kernel, instead, is given by the following expression: \[ K_{P_p}(x,x^\prime)=(\langle x,x^\prime\rangle+c)^p \]
Here, \(p\) denotes the degree of the polynomial mapping. Ideally, \(p\) should be equal to the degree of the decision boundary of the classification problem in the original feature space. This kernel is also characterized by the intercept \(c\).
Finally, the sigmoid kernel takes the following form: \[ K_{S}(x,x^\prime)=\tanh(\gamma\langle x,x^\prime\rangle+c) \] This kernel choice leads to an algorithm closely related with simple neural networks. It is characterized by the slope \(\gamma>0\) and the intercept \(c>0\).
Also in this case we decided to implement the different error cost method, so we assign different weights to each class.
The best kernel and the associated optimal hyperparameters values are assessed again through cross-validation, analogously to what we did for the previous models.
Again, models are trained on the original data set and also on the three data sets that we got after applying SMOTE, RSLS handling outcasts.
gamma=2^seq(-4,0,0.5)
prop=c(sum(data.train$outcome=="failure"),sum(data.train$outcome=="success"))
weights=list(c(3,1),c(4,1),c(5,1),c(6,1),c(7,1),c(8,1),c(9,1))
meas.matrix<-array(0,dim=c(5,length(gamma),length(weights)),
dimnames=list(c("Accuracy","Precision","Sensitivity",
"Specificity","F1"),gamma,weights))
mm=matrix(0,5,nfolds*repeats)
set.seed(3)
gg=0
folds<-create_folds(data.train$outcome,k=nfolds,m_rep=repeats)
for(fold in folds){
gg=0
for(g in gamma){
gg=gg+1
ww=0
for(w in weights){
ww=ww+1
model.svm.radial=svm(outcome~.,data=data.train[fold,],method="C-classification",
kernel="radial",probability=T,scale=F,gamma=g,
class.weights=w*prop/table(data.train$outcome))
pred=predict(model.svm.radial,data.train[-fold,],probability=F)
cm=confusionMatrix(pred,data.train[-fold,]$outcome,positive="failure")
meas.matrix[1,gg,ww]=meas.matrix[1,gg,ww]+cm$overall[1];
meas.matrix[4,gg,ww]=meas.matrix[4,gg,ww]+cm$byClass[2];
tt=as.numeric(cm$table)
if(tt[1]==0 & tt[2]==0 & tt[3]==0){
meas.matrix[2,gg,ww]=meas.matrix[2,gg,ww]+1;
meas.matrix[3,gg,ww]=meas.matrix[3,gg,ww]+1;
meas.matrix[5,gg,ww]=meas.matrix[5,gg,ww]+1;
}
else if(tt[1]==0 & (tt[2]!=0 | tt[3]!=0)){
meas.matrix[2,gg,ww]=meas.matrix[2,gg,ww];
meas.matrix[3,gg,ww]=meas.matrix[3,gg,ww];
meas.matrix[5,gg,ww]=meas.matrix[5,gg,ww];
}
else{
meas.matrix[2,gg,ww]=meas.matrix[2,gg,ww]+cm$byClass[5];
meas.matrix[3,gg,ww]=meas.matrix[3,gg,ww]+cm$byClass[1];
meas.matrix[5,gg,ww]=meas.matrix[5,gg,ww]+cm$byClass[7];
}
}
}
}
meas.matrix=meas.matrix/(nfolds*repeats)
g.radial=NA
w.radial=NA
f1max.rad=0
for(g in 1:length(gamma)){
for(w in 1:length(weights)){
f1=meas.matrix["F1",g,w]
if(f1>f1max.rad){
f1max.rad=f1
g.radial=g
w.radial=w
}
}
}
f1max.rad
g.radial
w.radial
degree=seq(2,8,1)
c0=c(-10,5,0,5,10)
prop=c(sum(data.train$outcome=="failure"),sum(data.train$outcome=="success"))
weights=list(c(3,1),c(4,1),c(5,1),c(6,1),c(7,1),c(8,1),c(9,1))
meas.matrix<-array(0,dim=c(5,length(degree),length(weights),length(c0)),
dimnames=list(c("Accuracy","Precision","Sensitivity",
"Specificity","F1"),degree,weights,c0))
mm=matrix(0,5,nfolds*repeats)
set.seed(3)
gg=0
folds<-create_folds(data.train$outcome,k=nfolds,m_rep=repeats)
for(fold in folds){
dd=0
for(d in degree){
dd=dd+1
jj=0
for(j in c0){
jj=jj+1
ww=0
for(w in weights){
ww=ww+1
model.svm.radial=svm(outcome~.,data=data.train[fold,],method="C-classification",
kernel="polynomial",degree=d,coef0=j,probability=T,scale=F,
class.weights=w*prop/table(data.train$outcome))
pred=predict(model.svm.radial,data.train[-fold,],probability=F)
cm=confusionMatrix(pred,data.train[-fold,]$outcome,positive="failure")
meas.matrix[1,dd,ww,jj]=meas.matrix[1,dd,ww,jj]+cm$overall[1];
meas.matrix[4,dd,ww,jj]=meas.matrix[4,dd,ww,jj]+cm$byClass[2];
tt=as.numeric(cm$table)
if(tt[1]==0 & tt[2]==0 & tt[3]==0){
meas.matrix[2,dd,ww,jj]=meas.matrix[2,dd,ww,jj]+1;
meas.matrix[3,dd,ww,jj]=meas.matrix[3,dd,ww,jj]+1;
meas.matrix[5,dd,ww,jj]=meas.matrix[5,dd,ww,jj]+1;
}
else if(tt[1]==0 & (tt[2]!=0 | tt[3]!=0)){
meas.matrix[2,dd,ww,jj]=meas.matrix[2,dd,ww,jj];
meas.matrix[3,dd,ww,jj]=meas.matrix[3,dd,ww,jj];
meas.matrix[5,dd,ww,jj]=meas.matrix[5,dd,ww,jj];
}
else{
meas.matrix[2,dd,ww,jj]=meas.matrix[2,dd,ww,jj]+cm$byClass[5];
meas.matrix[3,dd,ww,jj]=meas.matrix[3,dd,ww,jj]+cm$byClass[1];
meas.matrix[5,dd,ww,jj]=meas.matrix[5,dd,ww,jj]+cm$byClass[7];
}
}
}
}
}
meas.matrix=meas.matrix/(nfolds*repeats)
d.pol=NA
w.pol=NA
c0.pol=NA
f1max.pol=0
for(d in 1:length(degree)){
for(w in 1:length(weights)){
for(j in 1:length(c0)){
f1=meas.matrix["F1",d,w,j]
if(f1>f1max.pol){
f1max.pol=f1
d.pol=d
w.pol=w
c0.pol=j
}
}
}
}
f1max.pol
d.pol
w.pol
c0.pol
gamma=seq(.5,4,.5)
c0=c(0,2,5,10)
prop=c(sum(data.train$outcome=="failure"),sum(data.train$outcome=="success"))
weights=list(c(3,1),c(4,1),c(5,1),c(6,1),c(7,1),c(8,1),c(9,1))
meas.matrix<-array(0,dim=c(5,length(gamma),length(weights),length(c0)),
dimnames=list(c("Accuracy","Precision","Sensitivity",
"Specificity","F1"),gamma,weights,c0))
mm=matrix(0,5,nfolds*repeats)
set.seed(3)
folds<-create_folds(data.train$outcome,k=nfolds,m_rep=repeats)
for(fold in folds){
dd=0
for(d in gamma){
dd=dd+1
jj=0
for(j in c0){
jj=jj+1
ww=0
for(w in weights){
ww=ww+1
model.svm.radial=svm(outcome~.,data=data.train[fold,],method="C-classification",
kernel="sigmoid",gamma=d,coef0=j,probability=T,scale=F,
class.weights=w*prop/table(data.train$outcome))
pred=predict(model.svm.radial,data.train[-fold,],probability=F)
cm=confusionMatrix(pred,data.train[-fold,]$outcome,positive="failure")
meas.matrix[1,dd,ww,jj]=meas.matrix[1,dd,ww,jj]+cm$overall[1];
meas.matrix[4,dd,ww,jj]=meas.matrix[4,dd,ww,jj]+cm$byClass[2];
tt=as.numeric(cm$table)
if(tt[1]==0 & tt[2]==0 & tt[3]==0){
meas.matrix[2,dd,ww,jj]=meas.matrix[2,dd,ww,jj]+1;
meas.matrix[3,dd,ww,jj]=meas.matrix[3,dd,ww,jj]+1;
meas.matrix[5,dd,ww,jj]=meas.matrix[5,dd,ww,jj]+1;
}
else if(tt[1]==0 & (tt[2]!=0 | tt[3]!=0)){
meas.matrix[2,dd,ww,jj]=meas.matrix[2,dd,ww,jj];
meas.matrix[3,dd,ww,jj]=meas.matrix[3,dd,ww,jj];
meas.matrix[5,dd,ww,jj]=meas.matrix[5,dd,ww,jj];
}
else{
meas.matrix[2,dd,ww,jj]=meas.matrix[2,dd,ww,jj]+cm$byClass[5];
meas.matrix[3,dd,ww,jj]=meas.matrix[3,dd,ww,jj]+cm$byClass[1];
meas.matrix[5,dd,ww,jj]=meas.matrix[5,dd,ww,jj]+cm$byClass[7];
}
}
}
}
}
meas.matrix=meas.matrix/(nfolds*repeats)
d.sig=NA
w.sig=NA
c0.sig=NA
f1max.sig=0
for(d in 1:length(degree)){
for(w in 1:length(weights)){
for(j in 1:length(c0)){
f1=meas.matrix["F1",d,w,j]
if(f1>f1max.sig){
f1max.sig=f1
d.sig=d
w.sig=w
c0.sig=j
}
}
}
}
f1max.sig
d.sig
w.sig
c0.sig
#Gauss
g.radial=2
w.radial=2
fimax.rad=0.6380476
#Poly
degree=seq(2,8,1)
c0=c(-10,5,0,5,10)
weights=list(c(3,1),c(4,1),c(5,1),c(6,1),c(7,1),c(8,1),c(9,1))
f1max.pol=0.6245238
d.pol=1
w.pol=1
c0.pol=2
#Sigmoid
gamma=seq(.5,4,.5)
c0=c(0,2,5,10)
weights=list(c(3,1),c(4,1),c(5,1),c(6,1),c(7,1),c(8,1),c(9,1))
f1max.sig=0.1720088
d.sig=1
w.sig=6
c0.sig=1
gamma=2^seq(-4,0,0.5)
weights=list(c(3,1),c(4,1),c(5,1),c(6,1),c(7,1),c(8,1),c(9,1))
cat('Best kernel: Gaussian, best gamma:',gamma[g.radial],', best weight for positive class:',weights[[w.radial]][1],', best threshold:',weights[[w.radial]][1]/(weights[[w.radial]][1]+weights[[w.radial]][2]),'for original.')
Best kernel: Gaussian, best gamma: 0.08838835 , best weight for positive class: 4 , best threshold: 0.8 for original.
gamma=2^seq(-4,0,.5)
prop=c(sum(data.train$outcome=="failure"),sum(data.train$outcome=="success"))
weights=list(c(1,1),c(2,1),c(3,1),c(3,1),c(5,1))
meas.matrix<-array(0,dim=c(5,length(gamma),length(weights)),
dimnames=list(c("Accuracy","Precision","Sensitivity",
"Specificity","F1"),gamma,weights))
mm=matrix(0,5,nfolds*repeats)
set.seed(3)
gg=0
folds<-create_folds(data.train$outcome,k=nfolds,m_rep=repeats)
for(fold in folds){
SMOTE_train<-smotefamily::SMOTE(data.train[fold,1:18],data.train[fold,19],K=5,dup_size=0)
SMOTE_train<-SMOTE_train$data
colnames(SMOTE_train)[19]="outcome"
SMOTE_train$outcome=as.factor(SMOTE_train$outcome)
gg=0
for(g in gamma){
gg=gg+1
ww=0
for(w in weights){
ww=ww+1
model.svm.radial=svm(outcome~.,data=SMOTE_train,method="C-classification",
kernel="radial",probability=T,scale=F,gamma=g,
class.weights=w*prop/table(data.train$outcome))
pred=predict(model.svm.radial,data.train[-fold,],probability=F)
cm=confusionMatrix(pred,data.train[-fold,]$outcome,positive="failure")
meas.matrix[1,gg,ww]=meas.matrix[1,gg,ww]+cm$overall[1];
meas.matrix[4,gg,ww]=meas.matrix[4,gg,ww]+cm$byClass[2];
tt=as.numeric(cm$table)
if(tt[1]==0 & tt[2]==0 & tt[3]==0){
meas.matrix[2,gg,ww]=meas.matrix[2,gg,ww]+1;
meas.matrix[3,gg,ww]=meas.matrix[3,gg,ww]+1;
meas.matrix[5,gg,ww]=meas.matrix[5,gg,ww]+1;
}
else if(tt[1]==0 & (tt[2]!=0 | tt[3]!=0)){
meas.matrix[2,gg,ww]=meas.matrix[2,gg,ww];
meas.matrix[3,gg,ww]=meas.matrix[3,gg,ww];
meas.matrix[5,gg,ww]=meas.matrix[5,gg,ww];
}
else{
meas.matrix[2,gg,ww]=meas.matrix[2,gg,ww]+cm$byClass[5];
meas.matrix[3,gg,ww]=meas.matrix[3,gg,ww]+cm$byClass[1];
meas.matrix[5,gg,ww]=meas.matrix[5,gg,ww]+cm$byClass[7];
}
}
}
}
meas.matrix=meas.matrix/(nfolds*repeats)
# choosing the best hyperparameters
g.radial.SMOTE=NA
w.radial.SMOTE=NA
f1max.rad.SMOTE=0
for(g in 1:length(gamma)){
for(w in 1:length(weights)){
f1=meas.matrix[5,g,w]
if(f1>f1max.rad.SMOTE){
f1max.rad.SMOTE=f1
g.radial.SMOTE=g
w.radial.SMOTE=w
}
}
}
f1max.rad.SMOTE
g.radial.SMOTE
w.radial.SMOTE
degree = seq(2,8,1)
c0= c(-10,-5,0,5,10)
prop=c(sum(data.train$outcome=="failure"),sum(data.train$outcome=="success"))
weights=list(c(1,1),c(2,1),c(3,1),c(4,1),c(5,1))
meas.matrix<-array(0,dim=c(5,length(degree),length(weights),length(c0)),
dimnames=list(c("Accuracy","Precision","Sensitivity",
"Specificity","F1"),degree,weights,c0))
mm=matrix(0,5,nfolds*repeats)
set.seed(3)
gg=0
folds<-create_folds(data.train$outcome,k=nfolds,m_rep=repeats)
for(fold in folds){
SMOTE_train<-smotefamily::SMOTE(data.train[fold,1:18],data.train[fold,19],K=5,dup_size=0)
SMOTE_train<-SMOTE_train$data
colnames(SMOTE_train)[19]="outcome"
SMOTE_train$outcome=as.factor(SMOTE_train$outcome)
gg=0
for(g in degree){
gg=gg+1
jj=0
for(j in c0){
jj=jj+1
ww=0
for(w in weights){
ww=ww+1
model.svm.radial=svm(outcome~.,data=SMOTE_train,method="C-classification",
kernel="polynomial",probability=T,scale=F,degree=g,coef0=j,
class.weights=w*prop/table(data.train$outcome))
pred=predict(model.svm.radial,data.train[-fold,],probability=F)
cm=confusionMatrix(pred,data.train[-fold,]$outcome,positive="failure")
meas.matrix[1,gg,ww,jj]=meas.matrix[1,gg,ww,jj]+cm$overall[1];
meas.matrix[4,gg,ww,jj]=meas.matrix[4,gg,ww,jj]+cm$byClass[2];
tt=as.numeric(cm$table)
if(tt[1]==0 & tt[2]==0 & tt[3]==0){
meas.matrix[2,gg,ww,jj]=meas.matrix[2,gg,ww,jj]+1;
meas.matrix[3,gg,ww,jj]=meas.matrix[3,gg,ww,jj]+1;
meas.matrix[5,gg,ww,jj]=meas.matrix[5,gg,ww,jj]+1;
}
else if(tt[1]==0 & (tt[2]!=0 | tt[3]!=0)){
meas.matrix[2,gg,ww,jj]=meas.matrix[2,gg,ww,jj];
meas.matrix[3,gg,ww,jj]=meas.matrix[3,gg,ww,jj];
meas.matrix[5,gg,ww,jj]=meas.matrix[5,gg,ww,jj];
}
else{
meas.matrix[2,gg,ww,jj]=meas.matrix[2,gg,ww,jj]+cm$byClass[5];
meas.matrix[3,gg,ww,jj]=meas.matrix[3,gg,ww,jj]+cm$byClass[1];
meas.matrix[5,gg,ww,jj]=meas.matrix[5,gg,ww,jj]+cm$byClass[7];
}
}
}
}
}
meas.matrix=meas.matrix/(nfolds*repeats)
g.polynomial.SMOTE=NA
w.polynomial.SMOTE=NA
c0.polynomial.SMOTE=NA
f1max.polynomial.SMOTE=0
for(g in 1:length(degree)){
for(j in 1:length(c0)){
for(w in 1:length(weights)){
f1=meas.matrix[5,g,w,j]
if(f1>f1max.polynomial.SMOTE){
f1max.polynomial.SMOTE=f1
g.polynomial.SMOTE=g
w.polynomial.SMOTE=w
c0.polynomial.SMOTE=j
}
}
}
}
f1max.polynomial.SMOTE
g.polynomial.SMOTE
w.polynomial.SMOTE
c0.polynomial.SMOTE
gamma = seq(0.5,4,0.5)
c0= c(2,5,10)
prop=c(sum(data.train$outcome=="failure"),sum(data.train$outcome=="success"))
weights=list(c(1,1),c(2,1),c(3,1),c(4,1),c(5,1))
meas.matrix<-array(0,dim=c(5,length(gamma),length(weights),length(c0)),
dimnames=list(c("Accuracy","Precision","Sensitivity",
"Specificity","F1"),gamma,weights,c0))
mm=matrix(0,5,nfolds*repeats)
set.seed(3)
gg=0
folds<-create_folds(data.train$outcome,k=nfolds,m_rep=repeats)
for(fold in folds){
SMOTE_train<-smotefamily::SMOTE(data.train[fold,1:18],data.train[fold,19],K=5,dup_size=0)
SMOTE_train<-SMOTE_train$data
colnames(SMOTE_train)[19]="outcome"
SMOTE_train$outcome=as.factor(SMOTE_train$outcome)
gg=0
for(g in gamma){
gg=gg+1
jj=0
for(j in c0){
jj=jj+1
ww=0
for(w in weights){
ww=ww+1
model.svm.radial=svm(outcome~.,data=SMOTE_train,method="C-classification",
kernel="sigmoid",probability=T,scale=F,gamma=g,coef0=j,
class.weights=w*prop/table(data.train$outcome))
pred=predict(model.svm.radial,data.train[-fold,],probability=F)
cm=confusionMatrix(pred,data.train[-fold,]$outcome,positive="failure")
meas.matrix[1,gg,ww,jj]=meas.matrix[1,gg,ww,jj]+cm$overall[1];
meas.matrix[4,gg,ww,jj]=meas.matrix[4,gg,ww,jj]+cm$byClass[2];
tt=as.numeric(cm$table)
if(tt[1]==0 & tt[2]==0 & tt[3]==0){
meas.matrix[2,gg,ww,jj]=meas.matrix[2,gg,ww,jj]+1;
meas.matrix[3,gg,ww,jj]=meas.matrix[3,gg,ww,jj]+1;
meas.matrix[5,gg,ww,jj]=meas.matrix[5,gg,ww,jj]+1;
}
else if(tt[1]==0 & (tt[2]!=0 | tt[3]!=0)){
meas.matrix[2,gg,ww,jj]=meas.matrix[2,gg,ww,jj];
meas.matrix[3,gg,ww,jj]=meas.matrix[3,gg,ww,jj];
meas.matrix[5,gg,ww,jj]=meas.matrix[5,gg,ww,jj];
}
else{
meas.matrix[2,gg,ww,jj]=meas.matrix[2,gg,ww,jj]+cm$byClass[5];
meas.matrix[3,gg,ww,jj]=meas.matrix[3,gg,ww,jj]+cm$byClass[1];
meas.matrix[5,gg,ww,jj]=meas.matrix[5,gg,ww,jj]+cm$byClass[7];
}
}
}
}
}
meas.matrix=meas.matrix/(nfolds*repeats)
g.sigmoid.SMOTE=NA
w.sigmoid.SMOTE=NA
c0.sigmoid.SMOTE=NA
f1max.sigmoid.SMOTE=0
for(g in 1:length(gamma)){
for(j in 1:length(c0)){
for(w in 1:length(weights)){
f1=meas.matrix[5,g,w,j]
if(f1>f1max.sigmoid.SMOTE){
f1max.sigmoid.SMOTE=f1
g.sigmoid.SMOTE=g
w.sigmoid.SMOTE=w
c0.sigmoid.SMOTE=j
}
}
}
}
g.sigmoid.SMOTE
w.sigmoid.SMOTE
c0.sigmoid.SMOTE
f1max.sigmoid.SMOTE
#Gauss SMOTE
g.radial.SMOTE=3
w.radial.SMOTE=1
f1max.rad.SMOTE=0.6285
#poly SMOTE
d.pol.SMOTE=1 #degree=seq(2,8,1)
w.pol.SMOTE=1 #weights=list(c(1,1),c(2,1),c(3,1),c(4,1),c(5,1))
c0.pol.SMOTE=4 #c0=seq(-10,10,5)
f1max.pol.SMOTE=0.6045714
#sigmoid SMOTE
gamma=seq(0.5,4,0.5)
c0=c(2,5,10)
prop=c(sum(data.train$outcome=="failure"),sum(data.train$outcome=="success"))
weights=list(c(1,1),c(2,1),c(3,1),c(3,1),c(5,1))
g.sigmoid.SMOTE=1
w.sigmoid.SMOTE=2
c0.sigmoid.SMOTE=1
f1max.sigmoid.SMOTE=0.1607936
gamma=2^seq(-4,0,.5)
weights=list(c(1,1),c(2,1),c(3,1),c(4,1),c(5,1))
cat('Best kernel: Gaussian, best gamma:',gamma[g.radial.SMOTE],', best weight for positive class:',weights[[w.radial.SMOTE]][1],', best threshold:',weights[[w.radial.SMOTE]][1]/(weights[[w.radial.SMOTE]][1]+weights[[w.radial.SMOTE]][2]),'for SMOTE.')
Best kernel: Gaussian, best gamma: 0.125 , best weight for positive class: 1 , best threshold: 0.5 for SMOTE.
data.train2=anti_join(data.train,outcast)
gamma=2^seq(-2.5,1,.5)
prop=c(sum(data.train2$outcome=="failure"),sum(data.train2$outcome=="success"))
weights=list(c(1,1),c(2,1),c(3,1),c(4,1),c(5,1),c(6,1),c(7,1),c(8,1),c(9,1))
meas.matrix<-array(0,dim=c(5,length(gamma),length(weights)),
dimnames=list(c("Accuracy","Precision","Sensitivity",
"Specificity","F1"),gamma,weights))
mm=matrix(0,5,nfolds*repeats)
set.seed(3)
gg=0
folds<-create_folds(data.train2$outcome,k=nfolds,m_rep=repeats)
for(fold in folds){
SMOTE_model<-smotefamily::RSLS(data.train2[fold,1:18],data.train2[fold,19],K=5,C=5,dupSize=0)
RSLSWO_train=SMOTE_model$data
colnames(RSLSWO_train)[19]="outcome"
RSLSWO_train$outcome=as.factor(RSLSWO_train$outcome)
levels(RSLSWO_train$outcome)=c("failure","success")
gg=0
for(g in gamma){
gg=gg+1
ww=0
for(w in weights){
ww=ww+1
model.svm.radial=svm(outcome~.,data=RSLSWO_train,method="C-classification",
kernel="radial",probability=T,scale=F,gamma=g,
class.weights=w*prop/table(data.train2$outcome))
pred=predict(model.svm.radial,data.train2[-fold,],probability=F)
cm=confusionMatrix(pred,data.train2[-fold,]$outcome,positive="failure")
meas.matrix[1,gg,ww]=meas.matrix[1,gg,ww]+cm$overall[1];
meas.matrix[4,gg,ww]=meas.matrix[4,gg,ww]+cm$byClass[2];
tt=as.numeric(cm$table)
if(tt[1]==0 & tt[2]==0 & tt[3]==0){
meas.matrix[2,gg,ww]=meas.matrix[2,gg,ww]+1;
meas.matrix[3,gg,ww]=meas.matrix[3,gg,ww]+1;
meas.matrix[5,gg,ww]=meas.matrix[5,gg,ww]+1;
}
else if(tt[1]==0 & (tt[2]!=0 | tt[3]!=0)){
meas.matrix[2,gg,ww]=meas.matrix[2,gg,ww];
meas.matrix[3,gg,ww]=meas.matrix[3,gg,ww];
meas.matrix[5,gg,ww]=meas.matrix[5,gg,ww];
}
else{
meas.matrix[2,gg,ww]=meas.matrix[2,gg,ww]+cm$byClass[5];
meas.matrix[3,gg,ww]=meas.matrix[3,gg,ww]+cm$byClass[1];
meas.matrix[5,gg,ww]=meas.matrix[5,gg,ww]+cm$byClass[7];
}
}
}
}
meas.matrix=meas.matrix/(nfolds*repeats)
g.radial.RSLS=NA
w.radial.RSLS=NA
f1max.radial.RSLS=0
for(g in 1:length(gamma)){
for(w in 1:length(weights)){
f1=meas.matrix[5,g,w]
if(f1>f1max.radial.RSLS){
f1max.radial.RSLS=f1
g.radial.RSLS=g
w.radial.RSLS=w
}
}
}
f1max.radial.RSLS
g.radial.RSLS
w.radial.RSLS
degree=seq(2,8,1)
c0=c(-10,-5,0,5,10)
prop=c(sum(data.train2$outcome=="failure"),sum(data.train2$outcome=="success"))
weights=list(c(1,1),c(2,1),c(3,1),c(4,1),c(5,1))
meas.matrix<-array(0,dim=c(5,length(degree),length(weights),length(c0)),
dimnames=list(c("Accuracy","Precision","Sensitivity",
"Specificity","F1"),degree,weights,c0))
set.seed(3)
gg=0
folds<-create_folds(data.train2$outcome,k=nfolds,m_rep=repeats)
for(fold in folds){
SMOTE_model<-smotefamily::RSLS(data.train2[fold,1:18],data.train2[fold,19],K=5,C=5,dupSize=0)
RSLSWO_train=SMOTE_model$data
colnames(RSLSWO_train)[19]="outcome"
RSLSWO_train$outcome=as.factor(RSLSWO_train$outcome)
levels(RSLSWO_train$outcome)=c("failure","success")
gg=0
for(g in degree){
gg=gg+1
jj=0
for(j in c0){
jj=jj+1
ww=0
for(w in weights){
ww=ww+1
model.svm.radial=svm(outcome~.,data=RSLSWO_train,method="C-classification",
kernel="polynomial",probability=T,scale=F,degree=g, coef0=j,
class.weights=w*prop/table(data.train2$outcome))
pred=predict(model.svm.radial,data.train2[-fold,],probability=F)
cm=confusionMatrix(pred,data.train2[-fold,]$outcome,positive="failure")
meas.matrix[1,gg,ww,jj]=meas.matrix[1,gg,ww,jj]+cm$overall[1];
meas.matrix[4,gg,ww,jj]=meas.matrix[4,gg,ww,jj]+cm$byClass[2];
tt=as.numeric(cm$table)
if(tt[1]==0 & tt[2]==0 & tt[3]==0){
meas.matrix[2,gg,ww,jj]=meas.matrix[2,gg,ww,jj]+1;
meas.matrix[3,gg,ww,jj]=meas.matrix[3,gg,ww,jj]+1;
meas.matrix[5,gg,ww,jj]=meas.matrix[5,gg,ww,jj]+1;
}
else if(tt[1]==0 & (tt[2]!=0 | tt[3]!=0)){
meas.matrix[2,gg,ww,jj]=meas.matrix[2,gg,ww,jj];
meas.matrix[3,gg,ww,jj]=meas.matrix[3,gg,ww,jj];
meas.matrix[5,gg,ww,jj]=meas.matrix[5,gg,ww,jj];
}
else{
meas.matrix[2,gg,ww,jj]=meas.matrix[2,gg,ww,jj]+cm$byClass[5];
meas.matrix[3,gg,ww,jj]=meas.matrix[3,gg,ww,jj]+cm$byClass[1];
meas.matrix[5,gg,ww,jj]=meas.matrix[5,gg,ww,jj]+cm$byClass[7];
}
}
}
}
}
meas.matrix=meas.matrix/(nfolds*repeats)
g.polynomial.RSLS=NA
w.polynomial.RSLS=NA
c0.polynomial.RSLS=NA
f1max.polynomial.RSLS=0
for(g in 1:length(degree)){
for(j in 1:length(c0)){
for(w in 1:length(weights)){
f1=meas.matrix[5,g,w,j]
if(f1>f1max.polynomial.RSLS){
f1max.polynomial.RSLS=f1
g.polynomial.RSLS=g
w.polynomial.RSLS=w
c0.polynomial.RSLS=j
}
}
}
}
f1max.polynomial.RSLS
w.polynomial.RSLS
g.polynomial.RSLS
c0.polynomial.RSLS
gamma = seq(0.5,4,0.5)
c0= c(2,5,10)
prop=c(sum(data.train2$outcome=="failure"),sum(data.train2$outcome=="success"))
weights=list(c(1,1),c(2,1),c(3,1),c(4,1),c(5,1))
meas.matrix<-array(0,dim=c(5,length(gamma),length(weights),length(c0)),
dimnames=list(c("Accuracy","Precision","Sensitivity",
"Specificity","F1"),gamma,weights,c0))
mm=matrix(0,5,nfolds*repeats)
set.seed(3)
gg=0
folds<-create_folds(data.train2$outcome,k=nfolds,m_rep=repeats)
for(fold in folds){
SMOTE_model<-smotefamily::RSLS(data.train2[fold,1:18],data.train2[fold,19],K=5,C=5,dupSize=0)
RSLSWO_train=SMOTE_model$data
colnames(RSLSWO_train)[19]="outcome"
RSLSWO_train$outcome=as.factor(RSLSWO_train$outcome)
levels(RSLSWO_train$outcome)=c("failure","success")
gg=0
for(g in gamma){
gg=gg+1
jj=0
for(j in c0){
jj=jj+1
ww=0
for(w in weights){
ww=ww+1
model.svm.radial=svm(outcome~.,data=RSLSWO_train,method="C-classification",
kernel="sigmoid",probability=T,scale=F,gamma=g, coef0=j,
class.weights=w*prop/table(data.train2$outcome))
pred=predict(model.svm.radial,data.train2[-fold,],probability=F)
cm=confusionMatrix(pred,data.train2[-fold,]$outcome,positive="failure")
meas.matrix[1,gg,ww,jj]=meas.matrix[1,gg,ww,jj]+cm$overall[1];
meas.matrix[4,gg,ww,jj]=meas.matrix[4,gg,ww,jj]+cm$byClass[2];
tt=as.numeric(cm$table)
if(tt[1]==0 & tt[2]==0 & tt[3]==0){
meas.matrix[2,gg,ww,jj]=meas.matrix[2,gg,ww,jj]+1;
meas.matrix[3,gg,ww,jj]=meas.matrix[3,gg,ww,jj]+1;
meas.matrix[5,gg,ww,jj]=meas.matrix[5,gg,ww,jj]+1;
}
else if(tt[1]==0 & (tt[2]!=0 | tt[3]!=0)){
meas.matrix[2,gg,ww,jj]=meas.matrix[2,gg,ww,jj];
meas.matrix[3,gg,ww,jj]=meas.matrix[3,gg,ww,jj];
meas.matrix[5,gg,ww,jj]=meas.matrix[5,gg,ww,jj];
}
else{
meas.matrix[2,gg,ww,jj]=meas.matrix[2,gg,ww,jj]+cm$byClass[5];
meas.matrix[3,gg,ww,jj]=meas.matrix[3,gg,ww,jj]+cm$byClass[1];
meas.matrix[5,gg,ww,jj]=meas.matrix[5,gg,ww,jj]+cm$byClass[7];
}
}
}
}
}
meas.matrix=meas.matrix/(nfolds*repeats)
g.sigmoid.RSLS=NA
w.sigmoid.RSLS=NA
c0.sigmoid.RSLS=NA
f1max.sigmoid.RSLS=0
for(g in 1:length(degree)){
for(j in 1:length(c0)){
for(w in 1:length(weights)){
f1=meas.matrix[5,g,w,j]
if(f1>f1max.sigmoid.RSLS){
f1max.sigmoid.RSLS=f1
g.sigmoid.RSLS=g
w.sigmoid.RSLS=w
c0.sigmoid.RSLS=j
}
}
}
}
f1max.sigmoid.RSLS
w.sigmoid.RSLS
g.sigmoid.RSLS
c0.sigmoid.RSLS
# Gaussian
f1max.radial.RSLS=0.7303333
g.radial.RSLS=2
w.radial.RSLS=3
# Poly
g.polynomial.RSLS=6
w.polynomial.RSLS=1
c0.polynomial.RSLS=5
f1max.polynomial.RSLS= 0.7191667
# Sigmoid
g.sigmoid.RSLS=1
w.sigmoid.RSLS=2
c0.sigmoid.RSLS=1
f1max.sigmoid.RSLS=0.1237311
gamma=2^seq(-2.5,1,.5)
weights=list(c(1,1),c(2,1),c(3,1),c(4,1),c(5,1),c(6,1),c(7,1),c(8,1),c(9,1))
cat('Best kernel: Gaussian, best gamma:',gamma[g.radial.RSLS],', best weight for positive class:',weights[[w.radial.RSLS]][1],', best threshold:', weights[[w.radial.RSLS]][1]/(weights[[w.radial.RSLS]][1]+weights[[w.radial.RSLS]][2]),'for RSLS without outcasts.')
Best kernel: Gaussian, best gamma: 0.25 , best weight for positive class: 3 , best threshold: 0.75 for RSLS without outcasts.
gamma=2^seq(-4,0,0.5)
prop=c(sum(data.train$outcome=="failure"),sum(data.train$outcome=="success"))
weights=list(c(3,1),c(4,1),c(5,1),c(6,1),c(7,1),c(8,1),c(9,1))
model.svm.radial=svm(outcome~.,data=data.train,method="C-classification",
kernel="radial",probability=T,scale=F,gamma=gamma[g.radial],
class.weights=weights[[w.radial]]*prop/table(data.train$outcome))
pred.prob.svm.radial=predict(model.svm.radial,data.test,probability=F)
cm=confusionMatrix(pred.prob.svm.radial,data.test$outcome,positive="failure")
draw_confusion_matrix(cm)
final.meas[8,1]=cm$byClass[[7]]
final.meas[8,2]=cm$byClass[[1]]
final.meas[8,3]=cm$byClass[[5]]
gamma=2^seq(-4,0,.5)
prop=c(sum(data.train$outcome=="failure"),sum(data.train$outcome=="success"))
weights=list(c(1,1),c(2,1),c(3,1),c(4,1),c(5,1))
model.svm.radial.SMOTE=svm(outcome~.,data=SMOTEtrain,method="C-classification",
kernel="radial",probability=T,scale=F,gamma=gamma[g.radial.SMOTE],
class.weights=weights[[w.radial.SMOTE]]*prop/table(data.train$outcome))
pred.prob.svm.radial.SMOTE=predict(model.svm.radial.SMOTE,data.test,probability=F)
cm=confusionMatrix(pred.prob.svm.radial.SMOTE,data.test$outcome,positive="failure")
draw_confusion_matrix(cm)
final.meas[9,1]=cm$byClass[[7]]
final.meas[9,2]=cm$byClass[[1]]
final.meas[9,3]=cm$byClass[[5]]
gamma=2^seq(-2.5,1,.5)
prop=c(sum(data.train2$outcome=="failure"),sum(data.train2$outcome=="success"))
weights=list(c(1,1),c(2,1),c(3,1),c(4,1),c(5,1),c(6,1),c(7,1),c(8,1),c(9,1))
RSLSWOtrain$outcome=as.factor(RSLSWOtrain$outcome)
model.svm.radial.RSLS=svm(outcome~.,data=RSLSWOtrain,method="C-classification",
kernel="radial",probability=T,scale=F,gamma=gamma[g.radial.RSLS],
class.weights=weights[[w.radial.RSLS]]*prop/table(data.train$outcome))
pred.prob.svm.radial.RSLS=predict(model.svm.radial.RSLS,data.test,probability=F)
cm=confusionMatrix(pred.prob.svm.radial.RSLS,data.test$outcome,positive="failure")
draw_confusion_matrix(cm)
pred.knn1=knn3Train(train=negOutTrain[,c(1,2,13,14)],test=data.test[,c(1,2,13,14)],cl=negOutTrain$outcome,k=1)
pred.prob.svm.radial.RSLS[pred.knn1=="failure"]="failure"
cm=confusionMatrix(pred.prob.svm.radial.RSLS,data.test$outcome,positive="failure")
draw_confusion_matrix(cm)
final.meas[10,1]=cm$byClass[[7]]
final.meas[10,2]=cm$byClass[[1]]
final.meas[10,3]=cm$byClass[[5]]
Figure 2.12: Confusion matrix for SVM with Gaussian kernel trained on the original training set (Topleft), on the SMOTE training set (Topright), on RSLS training set without outcasts (Bottomleft) and on RSLS handling outcast training set (Bottomright).
Random Forest is a tree-based method for regression and classification that combines a large number of trees in order to achieve better results in prediction accuracy, at the expense of some loss in interpretation.
In the classification setting, two main steps must be followed for building a simple decision tree:
dividing the predictor space, that is the set of possible values for \(X_1,\dots,X_d\), into \(J\) distinct and non-overlapping rectangular regions \(R_1,\dots,R_J\).
making the same prediction for every observation that falls into the region \(R_j\), i.e. the most commonly occurring class of training observations in such region.
The goal is to find regions \(R_1,\dots,R_J\) that minimizes the classification error rate, i.e. the fraction of the training observations in that region that do not belong to the most common class \[ E=1-\max_k\hat{p}_{jk}, \] where \(\hat{p}_{jk}\) represents the portion of training observations in the \(j\)th region that belong to the \(k\)th class. Since it is computationally unfeasible to consider every possible partition of the feature space into \(J\) rectangular regions, we take a top-down, greedy approach known as recursive binary splitting. In order to perform recursive binary splitting, we choose the predictor \(X_j\) and the cutpoint \(s\) for which splitting the predictor space into regions \[ \{X|X_j<s\}\text{ and } \{X|X_j \geq s\} \] which lead to the greatest possible reduction in classification error. Next, we repeat the process, looking for the best predictor-cutpoint pair in order to split the data further so as to minimize classification error within each of the resulting regions. This is done until a stopping criterion is reached: for example, we may continue until no region contains more than five observations. However, this process may lead to overfit data. Therefore, after growing a very large tree \(T_0\), cost complexity pruning can be applied in order to find the best trade-off between the subtree complexity and its fit to the training data.
However, it turns out that classification error is not sufficiently sensitive for tree-growing and Gini index is a much better measure to use when pruning. Gini index is defined as \[ G=\sum_{k=1}^K \hat{p}_{jk}(1-\hat{p}_{jk}) \]
It takes small values when all \(\hat{p}_{jk}\) are close to 0 or 1. For this reason, it is considered a measure of node purity, because a small value indicates that such node contains mainly observations from the same class. Nevertheless, the classification error rate is preferable if prediction accuracy is the goal.
The decision tree described above suffers from high variance. Citing [1], “Small perturbations to the values of the adjustable parameters can amplify and lead to large changes in simulation outputs.”. Therefore, we decided not to apply Decision Tree, since our data needs to be treated with more robust methods.
In order to reduce variance, we can use a procedure called bootstrap aggregation or bagging. This is done by following the steps below:
to construct \(B\) classification tree using \(B\) bootstrapped training sets;
record the class predicted by each of the \(B\) trees and take a majority vote: the overall prediction is the most commonly occurring class among the \(B\) predictions.
However, it turns out that all bagged trees will be highly correlated, because most of them resort to the same strong predictors in the top splits. Random Forest provides an improvement in this sense. In this case, for each tree and for each split, a random sample of \(m\) predictors is chosen as split candidates from the full set of \(d\) predictors.
In learning extremely imbalanced data, there is a significant probability that a bootstrapped sample contains few or even no observations of the minority class, resulting in a tree with poor performance on the minority class. We try to overcome this problem by passing “strata” attributes to randomForest function. For training random forest, we follow the same procedure of \(k\)-NN, choosing mtry (\(m\)), ntree (\(B\)) and the probability threshold via cross-validation.
library(randomForest)
repeats=10
nfolds=20
threshold=seq(0.1,0.9,by=0.1)
mtry= 1:18
ntree=c(50,100,150,200)
meas.matrix <- array(0,dim=c(5,length(threshold),length(mtry),length(ntree)), dimnames=list(c("Accuracy","Precision","Sensitivity","Specificity","F1"), threshold,mtry,ntree))
# We start by making repeated, stratified cross-validation folds
set.seed(3)
folds <- create_folds(data.train$outcome, k = nfolds, m_rep = repeats)
for(fold in folds){
kk=0
for(m in mtry){
kk=kk+1
nn=0
for(n in ntree){
nn=nn+1
mod = randomForest(outcome ~ ., data = data.train[fold,], ntree = n, mtry = m, strata=data.train[fold,]$outcome)
prob= predict(mod, data.train[-fold,], type="prob")
ss=0
for(s in threshold){
ss=ss+1
newpred = as.factor(ifelse(prob[,"failure"] > s, "failure", "success"))
levels(newpred) <- c("failure","success")
cm=confusionMatrix(newpred,data.train[-fold,]$outcome,positive="failure")
meas.matrix[1,ss,kk,nn]=meas.matrix[1,ss,kk,nn]+cm$overall[1]; # Accuracy and Specificity are never NA or NaN
meas.matrix[4,ss,kk,nn]=meas.matrix[4,ss,kk,nn]+cm$byClass[2];
tt=as.numeric(cm$table)
if(tt[1]==0 & tt[2]==0 & tt[3]==0){
meas.matrix[2,ss,kk,nn]=meas.matrix[2,ss,kk,nn]+1;
meas.matrix[3,ss,kk,nn]=meas.matrix[3,ss,kk,nn]+1;
meas.matrix[5,ss,kk,nn]=meas.matrix[5,ss,kk,nn]+1;
}
else if(tt[1]==0 & (tt[2]!=0 | tt[3]!=0)){
meas.matrix[2,ss,kk,nn]=meas.matrix[2,ss,kk,nn];
meas.matrix[3,ss,kk,nn]=meas.matrix[3,ss,kk,nn];
meas.matrix[5,ss,kk,nn]=meas.matrix[5,ss,kk,nn];
}
else{
meas.matrix[2,ss,kk,nn]=meas.matrix[2,ss,kk,nn]+cm$byClass[5];
meas.matrix[3,ss,kk,nn]=meas.matrix[3,ss,kk,nn]+cm$byClass[1];
meas.matrix[5,ss,kk,nn]=meas.matrix[5,ss,kk,nn]+cm$byClass[7];
}
}
}
}
}
meas.matrix/(nfolds*repeats)
## choose best m and threshold
m.rf.original=NA
s.rf.original=NA
n.rf.original=NA
f1max = 0
for(s in 1:length(threshold)){
for(m in 1:length(mtry)){
for(n in 1:length(ntree)){
f1 <- meas.matrix["F1",s,m,n]
if(f1max <f1){
f1max = f1
m.rf.original=m
s.rf.original=s
n.rf.original=n
}
}}
}
m.rf.original
s.rf.original
n.rf.original
threshold=seq(0.1,0.9,by=0.1)
mtry= 1:18
ntree= c(50,100,150,200)
m.rf.original=6
s.rf.original=2
n.rf.original=2
cat('Best m:',mtry[m.rf.original],',best threshold:',threshold[s.rf.original],', best value for ntree:',ntree[n.rf.original],'for original.')
Best m: 6 ,best threshold: 0.2 , best value for ntree: 100 for original.
repeats=10
nfolds=20
threshold=seq(0.1,0.9,by=0.1)
mtry= 1:18
ntree=c(50,100,150,200)
meas.matrix <- array(0,dim=c(5,length(threshold),length(mtry),length(ntree)),dimnames=list(c("Accuracy","Precision","Sensitivity","Specificity","F1"), threshold,mtry,ntree))
# We start by making repeated, stratified cross-validation folds
set.seed(3)
folds <- create_folds(data.train$outcome, k = nfolds, m_rep = repeats)
for(fold in folds){
SMOTE_train<-smotefamily::SMOTE(data.train[fold,1:18],data.train[fold,19],K=5,dup_size = 0)
SMOTE_train=SMOTE_train$data
colnames(SMOTE_train)[19]="outcome"
SMOTE_train$outcome=as.factor(SMOTE_train$outcome)
levels(SMOTE_train$outcome)=c("failure","success")
kk=0
for(m in mtry){
kk=kk+1
nn=0
for(n in ntree){
nn=nn+1
mod = randomForest(outcome ~ ., data = SMOTE_train, ntree = n, mtry = m, strata=SMOTE_train$outcome)
prob= predict(mod, data.train[-fold,], type="prob")
ss=0
for(s in threshold){
ss=ss+1
newpred = as.factor(ifelse(prob[,"failure"] > s, "failure", "success"))
levels(newpred) <- c("failure","success")
cm=confusionMatrix(newpred,data.train[-fold,]$outcome,positive="failure")
meas.matrix[1,ss,kk,nn]=meas.matrix[1,ss,kk,nn]+cm$overall[1]; # Accuracy and Specificity are never NA or NaN
meas.matrix[4,ss,kk,nn]=meas.matrix[4,ss,kk,nn]+cm$byClass[2];
tt=as.numeric(cm$table)
if(tt[1]==0 & tt[2]==0 & tt[3]==0){
meas.matrix[2,ss,kk,nn]=meas.matrix[2,ss,kk,nn]+1;
meas.matrix[3,ss,kk,nn]=meas.matrix[3,ss,kk,nn]+1;
meas.matrix[5,ss,kk,nn]=meas.matrix[5,ss,kk,nn]+1;
}
else if(tt[1]==0 & (tt[2]!=0 | tt[3]!=0)){
meas.matrix[2,ss,kk,nn]=meas.matrix[2,ss,kk,nn];
meas.matrix[3,ss,kk,nn]=meas.matrix[3,ss,kk,nn];
meas.matrix[5,ss,kk,nn]=meas.matrix[5,ss,kk,nn];
}
else{
meas.matrix[2,ss,kk,nn]=meas.matrix[2,ss,kk,nn]+cm$byClass[5];
meas.matrix[3,ss,kk,nn]=meas.matrix[3,ss,kk,nn]+cm$byClass[1];
meas.matrix[5,ss,kk,nn]=meas.matrix[5,ss,kk,nn]+cm$byClass[7];
}
}
}
}
}
meas.matrix/(nfolds*repeats)
## choose best m and threshold
m.rf.SMOTE=NA
s.rf.SMOTE=NA
n.rf.SMOTE=NA
f1max = 0
for(s in 1:length(threshold)){
for(m in 1:length(mtry)){
for(n in 1:length(ntree)){
f1 <- meas.matrix["F1",s,m,n]
if(f1max <f1){
f1max = f1
m.rf.SMOTE=m
s.rf.SMOTE=s
n.rf.SMOTE=n
}
}}
}
m.rf.SMOTE
s.rf.SMOTE
n.rf.SMOTE
m.rf.SMOTE=6
s.rf.SMOTE=3
n.rf.SMOTE=3
cat('Best m:',mtry[m.rf.SMOTE],',best threshold:',threshold[s.rf.SMOTE],', best value for ntree:',ntree[n.rf.SMOTE],'for SMOTE.')
Best m: 6 ,best threshold: 0.3 , best value for ntree: 150 for SMOTE.
repeats=10
nfolds=20
threshold=seq(0.1,0.9,by=0.1)
mtry= 1:18
ntree=c(50,100,150,200)
data.train2=anti_join(data.train,outcast)
meas.matrix <- array(0,dim=c(5,length(threshold),length(mtry),length(ntree)),dimnames=list(c("Accuracy","Precision","Sensitivity","Specificity","F1"), threshold,mtry,ntree))
# We start by making repeated, stratified cross-validation folds
set.seed(3)
folds <- create_folds(data.train2$outcome, k = nfolds, m_rep = repeats)
for(fold in folds){
SMOTE_model<-smotefamily::RSLS(data.train2[fold,-19],data.train2[fold,19],K=5,C=5,dupSize=0)
RSLSWO_train=SMOTE_model$data
colnames(RSLSWO_train)[19]="outcome"
RSLSWO_train$outcome=as.factor(RSLSWO_train$outcome)
levels(RSLSWO_train$outcome)=c("failure","success")
kk=0
for(m in mtry){
kk=kk+1
nn=0
for(n in ntree){
nn=nn+1
mod = randomForest(outcome ~ ., data = RSLSWO_train, ntree = n, mtry = m, strata=RSLSWO_train$outcome)
prob= predict(mod, data.train2[-fold,], type="prob")
ss=0
for(s in threshold){
ss=ss+1
newpred = as.factor(ifelse(prob[,"failure"] > s, "failure", "success"))
levels(newpred) <- c("failure","success")
cm=confusionMatrix(newpred,data.train2[-fold,]$outcome,positive="failure")
meas.matrix[1,ss,kk,nn]=meas.matrix[1,ss,kk,nn]+cm$overall[1]; # Accuracy and Specificity are never NA or NaN
meas.matrix[4,ss,kk,nn]=meas.matrix[4,ss,kk,nn]+cm$byClass[2];
tt=as.numeric(cm$table)
if(tt[1]==0 & tt[2]==0 & tt[3]==0){
meas.matrix[2,ss,kk,nn]=meas.matrix[2,ss,kk,nn]+1;
meas.matrix[3,ss,kk,nn]=meas.matrix[3,ss,kk,nn]+1;
meas.matrix[5,ss,kk,nn]=meas.matrix[5,ss,kk,nn]+1;
}
else if(tt[1]==0 & (tt[2]!=0 | tt[3]!=0)){
meas.matrix[2,ss,kk,nn]=meas.matrix[2,ss,kk,nn];
meas.matrix[3,ss,kk,nn]=meas.matrix[3,ss,kk,nn];
meas.matrix[5,ss,kk,nn]=meas.matrix[5,ss,kk,nn];
}
else{
meas.matrix[2,ss,kk,nn]=meas.matrix[2,ss,kk,nn]+cm$byClass[5];
meas.matrix[3,ss,kk,nn]=meas.matrix[3,ss,kk,nn]+cm$byClass[1];
meas.matrix[5,ss,kk,nn]=meas.matrix[5,ss,kk,nn]+cm$byClass[7];
}
}
}
}
}
meas.matrix/(nfolds*repeats)
## choose best m and threshold
m.rf.RSLSWO=NA
s.rf.RSLSWO=NA
n.rf.RSLSWO=NA
f1max = 0
for(s in 1:length(threshold)){
for(m in 1:length(mtry)){
for(n in 1:length(ntree)){
f1 <- meas.matrix["F1",s,m,n]
if(f1max <f1){
f1max = f1
m.rf.RSLSWO=m
s.rf.RSLSWO=s
n.rf.RSLSWO=n
}
}}
}
m.rf.RSLSWO
s.rf.RSLSWO
n.rf.RSLSWO
m.rf.RSLSWO=6
s.rf.RSLSWO=3
n.rf.RSLSWO=4
cat('Best m:',mtry[m.rf.RSLSWO],',best threshold:',threshold[s.rf.RSLSWO],', best value for ntree:',ntree[n.rf.RSLSWO],'for RSLS without outcasts.')
Best m: 6 ,best threshold: 0.3 , best value for ntree: 200 for RSLS without outcasts.
library(randomForest)
# train model
set.seed(2)
mod.rf.original = randomForest(outcome ~ ., data = data.train, ntree = ntree[n.rf.original], mtry = m.rf.original,
strata=data.train$outcome,importance=TRUE)
prob.rf.original= predict(mod.rf.original, data.test, type="prob")
newpred = as.factor(ifelse(prob.rf.original[,"failure"] > threshold[s.rf.original], "failure", "success"))
levels(newpred) <- c("failure","success")
cm=confusionMatrix(newpred,data.test$outcome,positive="failure")
draw_confusion_matrix(cm)
final.meas[11,1]=cm$byClass[[7]]
final.meas[11,2]=cm$byClass[[1]]
final.meas[11,3]=cm$byClass[[5]]
# train model
mod.rf.SMOTE = randomForest(outcome ~ ., data = SMOTEtrain, ntree = ntree[n.rf.SMOTE], mtry = m.rf.SMOTE,
strata=SMOTEtrain$outcome,importance=TRUE)
prob.rf.SMOTE= predict(mod.rf.SMOTE, data.test, type="prob")
newpred = as.factor(ifelse(prob.rf.SMOTE[,"failure"] > threshold[s.rf.SMOTE], "failure", "success"))
levels(newpred) <- c("failure","success")
cm=confusionMatrix(newpred,data.test$outcome,positive="failure")
draw_confusion_matrix(cm)
final.meas[12,1]=cm$byClass[[7]]
final.meas[12,2]=cm$byClass[[1]]
final.meas[12,3]=cm$byClass[[5]]
# train model
mod.rf.RSLS = randomForest(outcome ~ ., data = RSLSWOtrain, ntree = ntree[n.rf.RSLSWO],
mtry=m.rf.RSLSWO,strata=RSLSWOtrain$outcome,importance=TRUE)
prob.rf.RSLS= predict(mod.rf.RSLS, data.test, type="prob")
newpred = as.factor(ifelse(prob.rf.RSLS[,"failure"] > threshold[s.rf.RSLSWO], "failure", "success"))
levels(newpred) <- c("failure","success")
cm=confusionMatrix(newpred,data.test$outcome,positive="failure")
draw_confusion_matrix(cm)
pred.knn1=knn3Train(train=negOutTrain[,c(1,2,13,14)],test=data.test[,c(1,2,13,14)],cl=negOutTrain$outcome,k=1)
prob.rf.RSLSho=prob.rf.RSLS
prob.rf.RSLSho[pred.knn1=="failure",]=prob.acc[pred.knn1=="failure",]
newpred[pred.knn1=="failure"]="failure"
cm=confusionMatrix(newpred,data.test$outcome,positive="failure")
draw_confusion_matrix(cm)
final.meas[13,1]=cm$byClass[[7]]
final.meas[13,2]=cm$byClass[[1]]
final.meas[13,3]=cm$byClass[[5]]
Figure 2.13: Confusion matrix for Random Forest training on the original data set (Topleft), SMOTE training set (Topright), RSLS training set without outcasts (Bottomleft) and RSLS handling outcasts training set (Bottomright).
Random forest allows to obtain measures of the importance of each predictor by quantifying the total decrease in Gini index or accuracy due to splits on that predictor, averaged on all bagged trees. A large value of this quantity indicates that this predictors plays a crucial role in the classification procedure.
Notice that all the importance values are rescaled from \(0\) to \(100\). The plots below show that in the rebalanced data set the most important predictors correspond to the ones that we already detected by looking at figure 1.3. However, this does not hold for the original training set. In fact, even if the top \(2\) predictors (vconst_corr and vconst_2) remain the same, other predictors seems to become relevant for the classification of failure instances.
rescale <- function(x) (x-min(x))/(max(x) - min(x)) * 100
imp.original.rf <- importance(mod.rf.original) # importance of each variable for each class w.r.t. accuracy, accuracy and Gini index
df=data.frame(imp.original.rf)
df=data.frame(rownames(df),df)
fig <- plot_ly(type = 'bar')
df1=df[order(df$MeanDecreaseGini,decreasing=FALSE),]
fig <- fig%>%add_trace(x=rescale(df1$MeanDecreaseGini),y=df1$rownames.df., visible=T)
df2=df[order(df$MeanDecreaseAccuracy,decreasing=FALSE),]
Importance = rescale(df2$MeanDecreaseAccuracy)
fig <- fig%>%add_trace(x=rescale(df2$MeanDecreaseAccuracy),y=df2$rownames.df.,visible=F)
df3=df[order(df$failure,decreasing=FALSE),]
Importance = rescale(df3$failure)
fig <- fig%>% add_trace(x=rescale(df3$failure),y=df3$rownames.df.,visible=F)
df4=df[order(df$success,decreasing=FALSE),]
Importance = rescale(df4$success)
fig <- fig%>% add_trace(x=rescale(df4$success),y=df4$rownames.df.,visible=F)
fig <- fig %>% layout(
title = "Importance plot - original",
xaxis = list(domain = c(0.3, 1)),
barmode='stack',
updatemenus = list(
list(
y = 0.8,
active = 0,
buttons = list(
list(method = "update",
args = list( list(visible=c(F,T,F,F,F)), list(yaxis = list('categoryorder'='array', 'categoryarray'=df1$rownames.df.))),
label = "MeanDecreaseGini",
title = "MeanDecreaseGini"
),
list(method = "update",
args = list( list(visible=c(F,F,T,F,F)), list(yaxis = list('categoryorder'='array', 'categoryarray'=df2$rownames.df.))),
label = "MeanDecreaseAccuracy",
title = "MeanDecreaseAccuracy"
),
list(method = "update",
args = list(list(visible=c(F,F,F,T,F)), list(yaxis = list('categoryorder'='array', 'categoryarray'=df3$rownames.df.))),
label = "failure",
title="failure"
),
list(method = "update",
args = list(list(visible=c(F,F,F,F,T)), list(yaxis = list('categoryorder'='array', 'categoryarray'=df4$rownames.df.))),
label = "success",
title="success"
)
)
)
),
showlegend=FALSE
)
fig
Figure 2.14: Importance plot in the original training set.
imp.SMOTE.rf <- importance(mod.rf.SMOTE) # importance of each variable for each class w.r.t. accuracy, accuracy and Gini index
df=data.frame(imp.SMOTE.rf)
df=data.frame(rownames(df),df)
fig <- plot_ly(type = 'bar')
df1=df[order(df$MeanDecreaseGini,decreasing=FALSE),]
fig <- fig%>%add_trace(x=rescale(df1$MeanDecreaseGini),y=df1$rownames.df., visible=T)
df2=df[order(df$MeanDecreaseAccuracy,decreasing=FALSE),]
Importance = rescale(df2$MeanDecreaseAccuracy)
fig <- fig%>%add_trace(x=rescale(df2$MeanDecreaseAccuracy),y=df2$rownames.df.,visible=F)
df3=df[order(df$failure,decreasing=FALSE),]
Importance = rescale(df3$failure)
fig <- fig%>% add_trace(x=rescale(df3$failure),y=df3$rownames.df.,visible=F)
df4=df[order(df$success,decreasing=FALSE),]
Importance = rescale(df4$success)
fig <- fig%>% add_trace(x=rescale(df4$success),y=df4$rownames.df.,visible=F)
fig <- fig %>% layout(
title = "Importance plot - SMOTE",
xaxis = list(domain = c(0.3, 1)),
barmode='stack',
updatemenus = list(
list(
y = 0.8,
active = 0,
buttons = list(
list(method = "update",
args = list( list(visible=c(F,T,F,F,F)), list(yaxis = list('categoryorder'='array', 'categoryarray'=df1$rownames.df.))),
label = "MeanDecreaseGini",
title = "MeanDecreaseGini"
),
list(method = "update",
args = list( list(visible=c(F,F,T,F,F)), list(yaxis = list('categoryorder'='array', 'categoryarray'=df2$rownames.df.))),
label = "MeanDecreaseAccuracy",
title = "MeanDecreaseAccuracy"
),
list(method = "update",
args = list(list(visible=c(F,F,F,T,F)), list(yaxis = list('categoryorder'='array', 'categoryarray'=df3$rownames.df.))),
label = "failure",
title="failure"
),
list(method = "update",
args = list(list(visible=c(F,F,F,F,T)), list(yaxis = list('categoryorder'='array', 'categoryarray'=df4$rownames.df.))),
label = "success",
title="success"
)
)
)
),
showlegend=FALSE
)
fig
Figure 2.15: Importance plot in the SMOTE training set.
imp.RSLSWO.rf <- importance(mod.rf.RSLS) # importance of each variable for each class w.r.t. accuracy, accuracy and Gini index
df=data.frame(imp.RSLSWO.rf)
df=data.frame(rownames(df),df)
fig <- plot_ly(type = 'bar')
df1=df[order(df$MeanDecreaseGini,decreasing=FALSE),]
fig <- fig%>%add_trace(x=rescale(df1$MeanDecreaseGini),y=df1$rownames.df., visible=T)
df2=df[order(df$MeanDecreaseAccuracy,decreasing=FALSE),]
Importance = rescale(df2$MeanDecreaseAccuracy)
fig <- fig%>%add_trace(x=rescale(df2$MeanDecreaseAccuracy),y=df2$rownames.df.,visible=F)
df3=df[order(df$failure,decreasing=FALSE),]
Importance = rescale(df3$failure)
fig <- fig%>% add_trace(x=rescale(df3$failure),y=df3$rownames.df.,visible=F)
df4=df[order(df$success,decreasing=FALSE),]
Importance = rescale(df4$success)
fig <- fig%>% add_trace(x=rescale(df4$success),y=df4$rownames.df.,visible=F)
fig <- fig %>% layout(
title = "Importance plot - RSLS w/o outcast",
xaxis = list(domain = c(0.3, 1)),
barmode='stack',
updatemenus = list(
list(
y = 0.8,
active = 0,
buttons = list(
list(method = "update",
args = list( list(visible=c(F,T,F,F,F)), list(yaxis = list('categoryorder'='array', 'categoryarray'=df1$rownames.df.))),
label = "MeanDecreaseGini",
title = "MeanDecreaseGini"
),
list(method = "update",
args = list( list(visible=c(F,F,T,F,F)), list(yaxis = list('categoryorder'='array', 'categoryarray'=df2$rownames.df.))),
label = "MeanDecreaseAccuracy",
title = "MeanDecreaseAccuracy"
),
list(method = "update",
args = list(list(visible=c(F,F,F,T,F)), list(yaxis = list('categoryorder'='array', 'categoryarray'=df3$rownames.df.))),
label = "failure",
title="failure"
),
list(method = "update",
args = list(list(visible=c(F,F,F,F,T)), list(yaxis = list('categoryorder'='array', 'categoryarray'=df4$rownames.df.))),
label = "success",
title="success"
)
)
)
),
showlegend=FALSE
)
fig
Figure 2.16: Importance plot in the RSLS without outcasts training set.
We now display ROC and PR curves of our models.
pred.prob.logistic=1-predict(regLogStepOriginal,data.test,type="response")
pred.prob.knn=prob.knn.original[,1]
pred.prob.knn.SMOTE=prob.knn.SMOTE[,1]
#pred.prob.knn.RSLS=prob.knn.RSLS[,1]
pred.prob.knn.RSLSho=prob.knn.RSLSho[,1]
pred.prob.svm.linear=predict(model.svm.linear,data.test,probability=T)
pred.prob.svm.linear.SMOTE=predict(model.svm.linear.SMOTE,data.test,probability=T)
pred.prob.svm.linear.RSLS=predict(model.svm.linear.RSLS,data.test,probability=T)
pred.prob.svm.linear.RSLSho=pred.prob.svm.linear.RSLS
pred.prob.svm.linear.RSLSho=attr(pred.prob.svm.linear.RSLSho,"prob")
pred.prob.svm.linear.RSLSho[pred.knn1=="failure",]=prob.acc[pred.knn1=="failure",]
pred.prob.svm.kernel=predict(model.svm.radial,data.test,probability=T)
pred.prob.svm.kernel.SMOTE=predict(model.svm.radial.SMOTE,data.test,probability=T)
pred.prob.svm.kernel.RSLS=predict(model.svm.radial.RSLS,data.test,probability=T)
pred.prob.svm.kernel.RSLSho=pred.prob.svm.kernel.RSLS
pred.prob.svm.kernel.RSLSho=attr(pred.prob.svm.kernel.RSLSho,"prob")
pred.prob.svm.kernel.RSLSho[pred.knn1=="failure",]=prob.acc[pred.knn1=="failure",]
pred.prob.rf=prob.rf.original[,"failure"]
pred.prob.rf.SMOTE=prob.rf.SMOTE[,"failure"]
#pred.prob.rf.RSLS=prob.rf.RSLS[,"failure"]
pred.prob.rf.RSLSho=prob.rf.RSLSho[,"failure"]
pred.logistic=prediction(pred.prob.logistic,ifelse(data.test$outcome=="failure",1,0))
pred.knn=prediction(pred.prob.knn,ifelse(data.test$outcome=="failure",1,0))
pred.knn.SMOTE=prediction(pred.prob.knn.SMOTE,ifelse(data.test$outcome=="failure",1,0))
#pred.knn.RSLS=prediction(pred.prob.knn.RSLS,ifelse(data.test$outcome=="failure",1,0))
pred.knn.RSLSho=prediction(pred.prob.knn.RSLSho,ifelse(data.test$outcome=="failure",1,0))
pred.svm.linear=prediction(attr(pred.prob.svm.linear,"probabilities")[,1],
ifelse(data.test$outcome=="failure",1,0))
pred.svm.linear.SMOTE=prediction(attr(pred.prob.svm.linear.SMOTE,"probabilities")[,1],
ifelse(data.test$outcome=="failure",1,0))
#pred.svm.linear.RSLS=prediction(attr(pred.prob.svm.linear.RSLS,"probabilities")[,1],
# ifelse(data.test$outcome=="failure",1,0))
pred.svm.linear.RSLSho=prediction(pred.prob.svm.linear.RSLSho[,1],
ifelse(data.test$outcome=="failure",1,0))
pred.svm.kernel=prediction(attr(pred.prob.svm.kernel,"probabilities")[,1],
ifelse(data.test$outcome=="failure",1,0))
pred.svm.kernel.SMOTE=prediction(attr(pred.prob.svm.kernel.SMOTE,"probabilities")[,1],
ifelse(data.test$outcome=="failure",1,0))
#pred.svm.kernel.RSLS=prediction(attr(pred.prob.svm.kernel.RSLS,"probabilities")[,1],
# ifelse(data.test$outcome=="failure",1,0))
pred.svm.kernel.RSLSho=prediction(pred.prob.svm.kernel.RSLSho[,1],
ifelse(data.test$outcome=="failure",1,0))
pred.rf=prediction(pred.prob.rf,ifelse(data.test$outcome=="failure",1,0))
pred.rf.SMOTE=prediction(pred.prob.rf.SMOTE,ifelse(data.test$outcome=="failure",1,0))
#pred.rf.RSLS=prediction(pred.prob.rf.RSLS,ifelse(data.test$outcome=="failure",1,0))
pred.rf.RSLSho=prediction(pred.prob.rf.RSLSho,ifelse(data.test$outcome=="failure",1,0))
perf.logistic=performance(pred.logistic,"tpr","fpr")
perf.knn=performance(pred.knn,"tpr","fpr")
perf.knn.SMOTE=performance(pred.knn.SMOTE,"tpr","fpr")
#perf.knn.RSLS=performance(pred.knn.RSLS,"tpr","fpr")
perf.knn.RSLSho=performance(pred.knn.RSLSho,"tpr","fpr")
perf.svm.linear=performance(pred.svm.linear,"tpr","fpr")
perf.svm.linear.SMOTE=performance(pred.svm.linear.SMOTE,"tpr","fpr")
#perf.svm.linear.RSLS=performance(pred.svm.linear.RSLS,"tpr","fpr")
perf.svm.linear.RSLSho=performance(pred.svm.linear.RSLSho,"tpr","fpr")
perf.svm.kernel=performance(pred.svm.kernel,"tpr","fpr")
perf.svm.kernel.SMOTE=performance(pred.svm.kernel.SMOTE,"tpr","fpr")
#perf.svm.kernel.RSLS=performance(pred.svm.kernel.RSLS,"tpr","fpr")
perf.svm.kernel.RSLSho=performance(pred.svm.kernel.RSLSho,"tpr","fpr")
perf.rf=performance(pred.rf,"tpr","fpr")
perf.rf.SMOTE=performance(pred.rf.SMOTE,"tpr","fpr")
#perf.rf.RSLS=performance(pred.rf.RSLS,"tpr","fpr")
perf.rf.RSLSho=performance(pred.rf.RSLSho,"tpr","fpr")
fig <- plot_ly(type = "scatter",mode="lines") %>%
add_trace(x=~perf.logistic@x.values[[1]], y = ~perf.logistic@y.values[[1]], type = "scatter",mode="lines",name="Logistic regression",visible=T) %>%
add_trace(x=~perf.knn@x.values[[1]], y = ~perf.knn@y.values[[1]], type = "scatter",mode="lines",name="k-NN - original",visible=T) %>%
add_trace(x=~perf.knn.SMOTE@x.values[[1]], y = ~perf.knn.SMOTE@y.values[[1]], type = "scatter",mode="lines",name="k-NN - SMOTE",visible=T) %>%
add_trace(x=~perf.knn.RSLSho@x.values[[1]], y = ~perf.knn.RSLSho@y.values[[1]], type = "scatter",mode="lines",name="k-NN - RSLSho",visible=T) %>%
add_trace(x=~perf.svm.linear@x.values[[1]], y = ~perf.svm.linear@y.values[[1]], type = "scatter",mode="lines",name="linear SVM - original",visible=T) %>%
add_trace(x=~perf.svm.linear.SMOTE@x.values[[1]], y = ~perf.svm.linear.SMOTE@y.values[[1]], type = "scatter",mode="lines",name="linear SVM - SMOTE",visible=T) %>%
add_trace(x=~perf.svm.linear.RSLSho@x.values[[1]], y = ~perf.svm.linear.RSLSho@y.values[[1]], type = "scatter",mode="lines",name="linear SVM - RSLSho",visible=T) %>%
add_trace(x=~perf.svm.kernel@x.values[[1]], y = ~perf.svm.kernel@y.values[[1]], type = "scatter",mode="lines",name="kernel SVM - original",visible=T) %>%
add_trace(x=~perf.svm.kernel.SMOTE@x.values[[1]], y = ~perf.svm.kernel.SMOTE@y.values[[1]], type = "scatter",mode="lines",name="kernel SVM - SMOTE",visible=T) %>%
add_trace(x=~perf.svm.kernel.RSLSho@x.values[[1]], y = ~perf.svm.kernel.RSLSho@y.values[[1]], type = "scatter",mode="lines",name="kernel SVM - RSLSho",visible=T) %>%
add_trace(x=~perf.rf@x.values[[1]], y = ~perf.rf@y.values[[1]], type = "scatter",mode="lines",name="Random forest - original",visible=T) %>%
add_trace(x=~perf.rf.SMOTE@x.values[[1]], y = ~perf.rf.SMOTE@y.values[[1]], type = "scatter",mode="lines",name="Random forest - SMOTE",visible=T)%>%
add_trace(x=~perf.rf.RSLSho@x.values[[1]], y = ~perf.rf.RSLSho@y.values[[1]], type = "scatter",mode="lines",name="Random forest - RSLSho",visible=T)
fig <- fig %>% layout(
title = "ROC",
xaxis = list(title="1-Specificity",domain = c(0.1, 1)),
yaxis = list(title = "Sensitivity"),
updatemenus = list(
list(
y = 0.8,
active = 0,
buttons = list(
list(method = "update",
args = list( list(visible=c(F,T,T,T,T,T,T,T,T,T,T,T,T,T))),
label = "All",
title = "All"
),
list(method = "update",
args = list( list(visible=c(F,T,F,F,F,F,F,F,F,F,F,F,F,F))),
label = "Logistic regression",
title = "Logistic regression"
),
list(method = "update",
args = list( list(visible=c(F,F,T,F,F,F,F,F,F,F,F,F,F,F))),
label = "k-NN - original",
title = "k-NN - original"
),
list(method = "update",
args = list(list(visible=c(F,F,F,T,F,F,F,F,F,F,F,F,F,F))),
label = "k-NN - SMOTE",
title="k-NN - SMOTE"
),
list(method = "update",
args = list( list(visible=c(F,F,F,F,T,F,F,F,F,F,F,F,F,F))),
label = "k-NN - RSLSho",
title = "k-NN - RSLSho"
),
list(method = "update",
args = list( list(visible=c(F,F,F,F,F,T,F,F,F,F,F,F,F,F))),
label = "linear SVM - original",
title = "linear SVM - original"
),
list(method = "update",
args = list(list(visible=c(F,F,F,F,F,F,T,F,F,F,F,F,F,F))),
label = "linear SVM - SMOTE",
title="linear SVM - SMOTE"
),
list(method = "update",
args = list( list(visible=c(F,F,F,F,F,F,F,T,F,F,F,F,F,F))),
label = "linear SVM - RSLSho",
title = "linear SVM - RSLSho"
),
list(method = "update",
args = list( list(visible=c(F,F,F,F,F,F,F,F,T,F,F,F,F,F))),
label = "kernel SVM - original",
title = "kernel SVM - original"
),
list(method = "update",
args = list(list(visible=c(F,F,F,F,F,F,F,F,F,T,F,F,F,F))),
label = "kernel SVM - SMOTE",
title="kernel SVM - SMOTE"
),
list(method = "update",
args = list( list(visible=c(F,F,F,F,F,F,F,F,F,F,T,F,F,F))),
label = "kernel SVM - RSLSho",
title = "kernel SVM - RSLSho"
),
list(method = "update",
args = list( list(visible=c(F,F,F,F,F,F,F,F,F,F,F,T,F,F))),
label = "Random forest - original",
title = "Random forest - original"
),
list(method = "update",
args = list(list(visible=c(F,F,F,F,F,F,F,F,F,F,F,F,T,F))),
label = "Random forest - SMOTE",
title="Random forest - SMOTE"
),
list(method = "update",
args = list( list(visible=c(F,F,F,F,F,F,F,F,F,F,F,F,F,T))),
label = "Random forest - RSLSho",
title = "Random forest - RSLSho"
)
)
)
))
fig
Figure 2.17: ROC curves.
perf.logistic=performance(pred.logistic,"prec","rec")
perf.knn=performance(pred.knn,"prec","rec")
perf.knn.SMOTE=performance(pred.knn.SMOTE,"prec","rec")
#perf.knn.RSLS=performance(pred.knn.RSLS,"prec","rec")
perf.knn.RSLSho=performance(pred.knn.RSLSho,"prec","rec")
perf.svm.linear=performance(pred.svm.linear,"prec","rec")
perf.svm.linear.SMOTE=performance(pred.svm.linear.SMOTE,"prec","rec")
#perf.svm.linear.RSLS=performance(pred.svm.linear.RSLS,"prec","rec")
perf.svm.linear.RSLSho=performance(pred.svm.linear.RSLSho,"prec","rec")
perf.svm.kernel=performance(pred.svm.kernel,"prec","rec")
perf.svm.kernel.SMOTE=performance(pred.svm.kernel.SMOTE,"prec","rec")
#perf.svm.kernel.RSLS=performance(pred.svm.kernel.RSLS,"prec","rec")
perf.svm.kernel.RSLSho=performance(pred.svm.kernel.RSLSho,"prec","rec")
perf.rf=performance(pred.rf,"prec","rec")
perf.rf.SMOTE=performance(pred.rf.SMOTE,"prec","rec")
#perf.rf.RSLS=performance(pred.rf.RSLS,"prec","rec")
perf.rf.RSLSho=performance(pred.rf.RSLSho,"prec","rec")
fig <- plot_ly(type = "scatter",mode="lines") %>%
add_trace(x=~perf.logistic@x.values[[1]], y = ~perf.logistic@y.values[[1]], type = "scatter",mode="lines",name="Logistic regression",visible=T) %>%
add_trace(x=~perf.knn@x.values[[1]], y = ~perf.knn@y.values[[1]], type = "scatter",mode="lines",name="k-NN - original",visible=T) %>%
add_trace(x=~perf.knn.SMOTE@x.values[[1]], y = ~perf.knn.SMOTE@y.values[[1]], type = "scatter",mode="lines",name="k-NN - SMOTE",visible=T) %>%
add_trace(x=~perf.knn.RSLSho@x.values[[1]], y = ~perf.knn.RSLSho@y.values[[1]], type = "scatter",mode="lines",name="k-NN - RSLSho",visible=T) %>%
add_trace(x=~perf.svm.linear@x.values[[1]], y = ~perf.svm.linear@y.values[[1]], type = "scatter",mode="lines",name="linear SVM - original",visible=T) %>%
add_trace(x=~perf.svm.linear.SMOTE@x.values[[1]], y = ~perf.svm.linear.SMOTE@y.values[[1]], type = "scatter",mode="lines",name="linear SVM - SMOTE",visible=T) %>%
add_trace(x=~perf.svm.linear.RSLSho@x.values[[1]], y = ~perf.svm.linear.RSLSho@y.values[[1]], type = "scatter",mode="lines",name="linear SVM - RSLSho",visible=T) %>%
add_trace(x=~perf.svm.kernel@x.values[[1]], y = ~perf.svm.kernel@y.values[[1]], type = "scatter",mode="lines",name="kernel SVM - original",visible=T) %>%
add_trace(x=~perf.svm.kernel.SMOTE@x.values[[1]], y = ~perf.svm.kernel.SMOTE@y.values[[1]], type = "scatter",mode="lines",name="kernel SVM - SMOTE",visible=T) %>%
add_trace(x=~perf.svm.kernel.RSLSho@x.values[[1]], y = ~perf.svm.kernel.RSLSho@y.values[[1]], type = "scatter",mode="lines",name="kernel SVM - RSLSho",visible=T) %>%
add_trace(x=~perf.rf@x.values[[1]], y = ~perf.rf@y.values[[1]], type = "scatter",mode="lines",name="Random forest - original",visible=T) %>%
add_trace(x=~perf.rf.SMOTE@x.values[[1]], y = ~perf.rf.SMOTE@y.values[[1]], type = "scatter",mode="lines",name="Random forest - SMOTE",visible=T)%>%
add_trace(x=~perf.rf.RSLSho@x.values[[1]], y = ~perf.rf.RSLSho@y.values[[1]], type = "scatter",mode="lines",name="Random forest - RSLSho",visible=T)
fig <- fig %>% layout(
title = "PRC",
xaxis = list(title="Recall",domain = c(0.1, 1)),
yaxis = list(title = "Precision"),
updatemenus = list(
list(
y = 0.8,
active = 0,
buttons = list(
list(method = "update",
args = list( list(visible=c(F,T,T,T,T,T,T,T,T,T,T,T,T,T))),
label = "All",
title = "All"
),
list(method = "update",
args = list( list(visible=c(F,T,F,F,F,F,F,F,F,F,F,F,F,F))),
label = "Logistic regression",
title = "Logistic regression"
),
list(method = "update",
args = list( list(visible=c(F,F,T,F,F,F,F,F,F,F,F,F,F,F))),
label = "k-NN - original",
title = "k-NN - original"
),
list(method = "update",
args = list(list(visible=c(F,F,F,T,F,F,F,F,F,F,F,F,F,F))),
label = "k-NN - SMOTE",
title="k-NN - SMOTE"
),
list(method = "update",
args = list( list(visible=c(F,F,F,F,T,F,F,F,F,F,F,F,F,F))),
label = "k-NN - RSLSho",
title = "k-NN - RSLSho"
),
list(method = "update",
args = list( list(visible=c(F,F,F,F,F,T,F,F,F,F,F,F,F,F))),
label = "linear SVM - original",
title = "linear SVM - original"
),
list(method = "update",
args = list(list(visible=c(F,F,F,F,F,F,T,F,F,F,F,F,F,F))),
label = "linear SVM - SMOTE",
title="linear SVM - SMOTE"
),
list(method = "update",
args = list( list(visible=c(F,F,F,F,F,F,F,T,F,F,F,F,F,F))),
label = "linear SVM - RSLSho",
title = "linear SVM - RSLSho"
),
list(method = "update",
args = list( list(visible=c(F,F,F,F,F,F,F,F,T,F,F,F,F,F))),
label = "kernel SVM - original",
title = "kernel SVM - original"
),
list(method = "update",
args = list(list(visible=c(F,F,F,F,F,F,F,F,F,T,F,F,F,F))),
label = "kernel SVM - SMOTE",
title="kernel SVM - SMOTE"
),
list(method = "update",
args = list( list(visible=c(F,F,F,F,F,F,F,F,F,F,T,F,F,F))),
label = "kernel SVM - RSLSho",
title = "kernel SVM - RSLSho"
),
list(method = "update",
args = list( list(visible=c(F,F,F,F,F,F,F,F,F,F,F,T,F,F))),
label = "Random forest - original",
title = "Random forest - original"
),
list(method = "update",
args = list(list(visible=c(F,F,F,F,F,F,F,F,F,F,F,F,T,F))),
label = "Random forest - SMOTE",
title="Random forest - SMOTE"
),
list(method = "update",
args = list( list(visible=c(F,F,F,F,F,F,F,F,F,F,F,F,F,T))),
label = "Random forest - RSLSho",
title = "Random forest - RSLSho"
)
)
)
))
fig
Figure 2.18: Precision-Recall curves.
By looking at figure 2.17, it is not easy to take a decision right away on which are the models that performs the best. However, figure 2.18 highlights that actually there are significant differences among the classifiers.
By looking at the table above, we can say that SVM is the classifier that performs the best, regardless of the kernel or the training set used to train it.
Logistic regression also returns quite satisfactory results in all the performance measures taken into consideration, whereas random forest seems not to be appropriate models to face this classification task. About \(k\)-NN, it returns a good performance only when it is trained on the original training set.
We can also notice that using SMOTE results in models with a significantly low value of recall, specifically below to \(50%\). Hence, this technique is not helpful in achieving our main goal, that is the one of correctly predict as most positive instances as possible.
Regarding the F1-measure, the best model is kernel (Gaussian) SVM with RSLSho and DEC. In fact, this model performs well also on precision and recall.
Now, it is worth to make some observations about RSLSho technique. During preliminary analysis, we noticed that there are some noisy observation, which can be detected looking at the first column of figure 2.3. This is the reason why we decided to perform RSLSho.
By evaluating Cook’s Distance, we found that these points are also high leverage points for the logistic regression model, especially observation No. 296.
As it can be seen in figure 2.19, this is consistent with the values that we computed for Mahalanobis distance. This distance measures how far a point is placed from the mean of its distribution. It is a generalization to multidimensional spaces of the idea of measuring how many standard deviations separates the two points. In fact, its definition is given by: \[ d_M(x)=\sqrt{(x-\mu)^T \Sigma(x-\mu)}, \] where \(\mu\) is the mean and \(\Sigma\) is the covariance matrix of the distribution.
library(plotly)
fails=climate[climate$outcome=="failure",3:20]
successes=climate[climate$outcome=="success",3:20]
fails$m_dist=mahalanobis(fails,colMeans(fails),cov(fails))
successes$m_dist=mahalanobis(successes,colMeans(successes),cov(successes))
b1=boxplot(fails$m_dist,plot=FALSE)
b2=boxplot(successes$m_dist,plot=FALSE)
OL_fails=fails[fails$m_dist>b1$stats[5],]
OL_non_fails=successes[successes$m_dist>b2$stats[5],]
fig <- plot_ly(y = ~successes$m_dist, type = "box",fillcolor=alpha('green', 0.2),line=list(color="green"),name="Success",boxpoints = 'outliers',
marker = list(color ='green'),
line = list(color = 'green'))
fig <- fig %>% add_trace(y = ~fails$m_dist,fillcolor=alpha('red', 0.2),line=list(color="red"),name="Failure",boxpoints = 'outliers',
marker = list(color ='red'),
line = list(color = 'red'))
fig <- fig %>% layout(
yaxis = list(
title = "Mahalanobis Distance"
)
)
fig
Figure 2.19: Boxplots of most Mahalanobis distance distribution per class.
Again, the furthest point from the distribution of the failure class points is
OL_fails # 4 outcasts
vconst_corr vconst_2 vconst_3 vconst_4 vconst_5 vconst_7 ah_corr
296 0.07982585 0.2033237 0.7719705 0.1394137 0.02110685 0.168604 0.5645766
ah_bolus slm_corr efficiency_factor tidal_mix_max vertical_decay_scale
296 0.3264513 0.4127048 0.9387219 0.4439851 0.2089077
convect_corr bckgrnd_vdc1 bckgrnd_vdc_ban bckgrnd_vdc_eq bckgrnd_vdc_psim
296 0.6889979 0.9459164 0.004030131 0.03827201 0.06606623
Prandtl m_dist
296 0.7368638 32.56307
However, the RSLSho technique that we applied to better manage the issue of noise does not seem to solve the problem. Looking at the confusion matrices of Gaussian SVM (or another one), we notice that the number of false positive increases after applying \(1\)-NN correction. In particular, \(1\)-NN trained on the set consisting of outcasts and training negative instances classifies as failures the following observations, but it misses three observations (namely No. 362, No. 455 and No. 536):
pred.knn1=knn3Train(train=negOutTrain[,c(1,2,13,14)],test=data.test[,c(1,2,13,14)],cl=negOutTrain$outcome,k=1,prob=TRUE)
data.test[pred.knn1=="failure",]
vconst_corr vconst_2 vconst_3 vconst_4 vconst_5 vconst_7 ah_corr
362 0.3699235 0.9876988 0.1704304 0.9815633 0.06874150 0.2806204 0.6139804
455 0.7902743 0.6854426 0.2889518 0.5118003 0.66125739 0.6028326 0.8912755
526 0.8428196 0.8778395 0.2668091 0.5843633 0.16104902 0.6764960 0.7827759
536 0.6571360 0.4893750 0.1337129 0.4119496 0.08777969 0.3562886 0.4802044
ah_bolus slm_corr efficiency_factor tidal_mix_max vertical_decay_scale
362 0.32123615 0.5432369 0.2836518 0.2618274 0.7372218
455 0.52906164 0.5744786 0.1553106 0.1137250 0.9780016
526 0.30658027 0.6727875 0.8855930 0.2203315 0.3046886
536 0.02967801 0.4001017 0.2805463 0.3841169 0.8859483
convect_corr bckgrnd_vdc1 bckgrnd_vdc_ban bckgrnd_vdc_eq bckgrnd_vdc_psim
362 0.7752597 0.1968281 0.6062493 0.7502562 0.09415387
455 0.8670257 0.5644948 0.3005283 0.6054734 0.15104151
526 0.6552971 0.1538240 0.9449775 0.7364240 0.59670410
536 0.7684816 0.4594791 0.3344821 0.5730016 0.61018319
Prandtl outcome
362 0.6667421 success
455 0.8168709 success
526 0.7139095 failure
536 0.7377059 success
In the end, Gaussian SVM seems to be the best algorithm for predicting crashes in climate models. This is consistent with the results reported in [1].
Moreover, RSLS technique is useful because it allows us to detect outcasts and to treat them separately, differently from SMOTE. However, the results we got from the outcasts management through its variation RSLSho are quite unsatisfactory. This suggests that it would be worth to get a deeper understanding about the nature of these points. Nevertheless, we do not know if using other techniques could lead to some improvements in the model training. This is due to the fact outcasts represent a significant portion of the positive cases:
cat(nrow(outcast)/nrow(data.train[data.train$outcome=="failure",])*100,'%')
25 %
Therefore, it is really difficult to find an underlying pattern of failure points that allows to detect them all.
[1] Lucas D D, Klein R, Tannahill J, Ivanova D, Brandon S, Domyancic D and Zhang Y 2013 Failure analysis of parameter-induced simulation crashes in climate models Geoscientific Model Development 6 1157–71
[2] Weiss. G M 2013 Foundations of imbalanced learning. (In book:) Imbalanced learning: Foundations, algorithms, and applications 13–41
[3] Siriseriwan W and Sinapiromsaran K 2016 The effective redistribution for imbalance dataset : Relocating safe-level smote with minority outcast handling. Chiang Mai Journal of Science. 1 234–46
[4] Alberto Fernández M G Salvador García and Herrera F 2018 Learning from imbalanced data sets. Chapter 4